Skip to content

feat: add store workspace import and publish workflows#4

Merged
Snider merged 87 commits intomainfrom
dev
Apr 27, 2026
Merged

feat: add store workspace import and publish workflows#4
Snider merged 87 commits intomainfrom
dev

Conversation

@Snider
Copy link
Copy Markdown
Contributor

@Snider Snider commented Apr 27, 2026

Routine dev→main PR opened by Hephaestus PR-cadence task.

This PR exists to:

  1. Trigger CodeRabbit auto-review of accumulated dev work
  2. Surface any blocking review feedback before merge
  3. Keep main current with dev once approved

ahead_by: 82 commit(s) (per gh api compare)

If CodeRabbit clears with no blocking comments, this PR is eligible for gh pr merge --merge (real merge commit, no force-push). Conflicts and review feedback should be addressed on the dev branch before merge.

Co-authored-by: Hephaestus hephaestus@cladius

Summary by CodeRabbit

  • New Features

    • Module identity updated to dappco.re/go/store; DuckDB-backed workspace buffering, inventory and import pipelines; publish-to-HuggingFace tool and Parquet export stubs; WorkspaceStateDirectory config and Workspace.Count().
  • Enhancements

    • Medium-backed import/export and archive publishing; transactional store/scoped-store APIs with namespace/quota semantics; improved compaction reliability; JSON helpers and DuckDB query/utility APIs.
  • Tests & Docs

    • Expanded test suites, new DEPENDENCIES.md and LICENCE.md, README and documentation updates.

Virgil added 30 commits April 4, 2026 15:47
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Align the scoped helper name with the rest of the package and fix the RFC reference paths so the docs point at real local sources.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Align the RFC text and store comments with the SQLite-backed journal implementation.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Use more descriptive internal field names in StoreTransaction to better match the AX naming guidance without changing behaviour.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Snider and others added 13 commits April 24, 2026 19:38
journal.go used strconv.{Atoi,ParseInt,ParseFloat} for parsing numeric
values from journal query results. core/go has no Atoi/ParseInt/
ParseFloat primitives yet, so dropped the strconv import by inlining
parseJournalInt64 and parseJournalFloat64 helpers that return
core.E()-wrapped errors to fit the codebase style.

Follow-up note: a proper core.ParseInt / core.ParseFloat addition
would let these helpers be removed — separate ticket.

Closes tasks.lthn.sh/view.php?id=258

Co-authored-by: Codex <noreply@openai.com>
Added `// Note:` trailers to 5 direct external deps:
- InfluxDB client: time-series storage backend
- klauspost compression: gzip/zstd for cold archive compaction
- modernc.org/sqlite: pure-Go SQLite driver
- DuckDB: workspace buffer analytical queries
- parquet-go: columnar format for journal archives

No core.* equivalents for any.

Closes tasks.lthn.sh/view.php?id=778

Co-authored-by: Codex <noreply@openai.com>
Updated stale cross-module dep path:
- go.mod: dappco.re/go/core/io v0.1.7 → dappco.re/go/io
- medium.go: import + doc comment rewritten

No stale path remains in .go or go.mod.

Closes tasks.lthn.sh/view.php?id=777

Co-authored-by: Codex <noreply@openai.com>
Added §7 Transaction API + §7.1 ScopedStoreTransaction covering:
- Store.Transaction(fn) commit/rollback semantics
- error/panic propagation behavior
- post-commit event dispatch
- Scope isolation during tx, scoped prefixing, local group names,
  namespace-local delete/purge/count, quota checks, event localization
- Added transaction.go to architecture table; renumbered later sections.

Closes tasks.lthn.sh/view.php?id=598

Co-authored-by: Codex <noreply@openai.com>
…struct

- Removed duplicate `Event` struct definition in §4; kept single canonical
- Added Type + Timestamp fields to Event; added EventType constants
- Expanded Store struct to match store.go: SQLite aliases, purge lifecycle
  fields, journal client/config, medium state, watcher/callback locks,
  orphan workspace cache

Normalized diff vs store.go now has zero differences.

Closes tasks.lthn.sh/view.php?id=599

Co-authored-by: Codex <noreply@openai.com>
- Bump dappco.re/go/* deps to v0.8.0-alpha.1 in go.mod (any forge.lthn.ai/core/* paths migrated to canonical dappco.re/go/* form)

Co-Authored-By: Athena <athena@lthn.ai>
Audit confirmed RFC §9.3 Import + Export functions exist (under canonical
names). Added regression test coverage in import_export_test.go to lock
in the contract:
- Good: CSV/JSON/JSONL ingestion + export round-trip
- Bad: malformed payloads
- Ugly: empty payloads

Race PASS.

Co-authored-by: Codex <noreply@openai.com>
Closes tasks.lthn.sh/view.php?id=260
…dium (#259)

Replaced io.Writer/Reader usage with coreio.Medium write/read calls per
RFC §9.4. gzip/zstd archive output now writes to Medium-backed buffer.
io import removed (compress/gzip retained — not banned).

Race PASS.

Co-authored-by: Codex <noreply@openai.com>
Closes tasks.lthn.sh/view.php?id=259
…261)

RecoverOrphans existed but silently skipped corrupt/unreadable orphan
files. Now corrupt files (incl. -wal and -shm sidecars) are quarantined
under <state>/quarantine/ instead of dropped silently.

Tests recover_test.go _Good (orphan recovered) / _Bad (corrupt
quarantined) / _Ugly (no orphans no-op). Race PASS.

Co-authored-by: Codex <noreply@openai.com>
Closes tasks.lthn.sh/view.php?id=261
json.go: bytes.Buffer → core.NewBuilder, removed bytes. parquet.go:
removed io via local narrow readCloser/writeCloser interfaces for
existing stream assertions.

Co-authored-by: Codex <noreply@openai.com>
medium.go: removed bytes, replaced bytes.Buffer with core.NewBuffer in
CSV parser. publish.go: removed bytes, replaced bytes.NewReader with
core.NewBuffer.

Co-authored-by: Codex <noreply@openai.com>
# Conflicts:
#	docs/RFC-STORE.md
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Module path changed to dappco.re/go/store. Adds DuckDB-backed workspace buffering, DuckDB import tooling, medium-backed import/export and archive routing, HuggingFace publishing, transactional/scoped APIs with quota enforcement, numerous docs updates, and replaces testify with local test assertion helpers.

Changes

Cohort / File(s) Summary
Module & Top-level Docs
go.mod, CLAUDE.md, README.md, DEPENDENCIES.md, LICENCE.md, docs/..., docs/RFC-STORE.md
Module path changed to dappco.re/go/store; dependency set reworked; new DEPENDENCIES and LICENCE files; docs updated to reference API renames (e.g., NewScopedConfigured) and WorkspaceStateDirectory.
Core Store & Configuration
store.go, doc.go, compact.go, compact_test.go
StoreConfig now normalises via Normalised(), Store options mutate *Store, StoreConfig gains Medium; compaction routes output through a Medium, adds CompactOptions.Normalised()/Validate(), and defers publish until tx commit.
Lifecycle / Events / Concurrency
events.go, events_test.go
Lifecycle state/locks renamed (lifecycleLock, isClosed/isClosing); watcher/callback registration and notify flow refactored; tests migrated to local assertion helpers.
Transactions & Scoped API
transaction.go, transaction_test.go, scope.go, scope_test.go
Introduces transaction-centric API (Store.Transaction), adds Exists/GroupExists/PurgeExpired on transactions; scoped APIs now use NewScopedConfigured, introduce quota validation/enforcement and namespace-local watcher bridges.
Medium I/O, Import & Export
medium.go, import.go, import_export_test.go, medium_test.go
Adds Medium alias (coreio.Medium), WithMedium, Import/Export for JSON/JSONL/CSV and ImportAll for DuckDB pipelines; extensive medium-backed tests and an in-memory medium test double.
DuckDB & Inventory
duckdb.go, inventory.go, import.go, import_test.go
New DuckDB type and query helpers, scoring tables, TableCounts, inventory printing and DuckDB import helpers; associated tests added.
Workspace Buffering & Recovery
workspace.go, workspace_test.go, recover_test.go
Workspace buffer migrated to DuckDB (db field), sequence-backed entry_id, BIGINT timestamps, orphan aggregation/quarantine and recovery logic; new Workspace.Count().
Journal & Query Parsing
journal.go, journal_test.go
Flux equality parsing refactored into typed (string vs scalar) filters, stricter time parsing, predicate generation updated; tests added for numeric/boolean filters.
Publishing
publish.go, publish_test.go
Adds PublishConfig and Publish() to publish dataset artifacts to HuggingFace, with token resolution, dry-run mode and HTTP upload flow; tests for validation and token resolution.
JSON & Parquet Helpers
json.go, parquet.go
Adds RawMessage and MarshalIndent JSON helpers; Parquet export entrypoints stubbed to avoid adding Parquet dependency.
Medium-backed Compact & Tests
compact.go, compact_test.go, medium_test.go
Compaction now writes archives via Medium (staged .tmp then publish), options normalisation/validation added; tests updated to use medium and local asserts.
Tests & Assertion Helpers
test_asserts_test.go, test_helpers_test.go, conventions_test.go, coverage_test.go, *test.go
Removes testify usage across tests in favour of local assert* helpers; many tests updated to ignore Close() errors in deferred cleanup.
Scope / Quota Enforcement
scope.go, scope_test.go
Scoped store readiness refactor (ensureReady()), ScopedStoreConfig.Validate() includes quota validation, live quota enforcement implemented, namespaced purge and scoped transaction adaptations.
Store Lifecycle & Medium Sync
store.go, related close paths
Close lifecycle refactored (isClosed/isClosing, lifecycleLock); medium-backed SQLite persistence on open/close (seed from medium, sync back on Close).
Utilities, CI & Tooling
.golangci.yml, .github/workflows/ci.yml, tests/cli/store/Taskfile.yaml, path_test.go
GolangCI config updated, CI sets CGO_ENABLED=1, Taskfile added for CLI tests; minor test fixes and tooling updates.
New Exports / Small API Additions
store.go, transaction.go, scope.go, medium.go, duckdb.go
New exported types/methods: StoreConfig.Normalised(), JournalConfiguration.Validate(), Store.Exists/GroupExists, Store.Medium(), DuckDB API, transaction/scoped existence & purge methods, and medium-related functions.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add store workspace import and publish workflows' accurately describes the main changes, which introduce import/publish APIs and supporting workspace infrastructure across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
workspace.go (1)

470-472: ⚠️ Potential issue | 🟠 Major

DuckDB sidecars are still being handled with SQLite suffixes.

After the move to .duckdb files, these paths still target -wal/-shm. DuckDB uses *.duckdb.wal, so Discard()/Commit() can leak WAL files and corrupt-orphan quarantine will leave the companion WAL behind.

Also applies to: 595-596

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspace.go` around lines 470 - 472, The cleanup currently appends SQLite
suffixes ("-wal", "-shm") to workspace.databasePath when deleting sidecars;
detect DuckDB files (when workspace.databasePath endsWith ".duckdb") and instead
target DuckDB sidecars (append ".wal" to produce "<name>.duckdb.wal" and skip
the "-shm" variant), using workspace.filesystem.Delete and
workspace.filesystem.Exists accordingly; apply the same branching change in the
other cleanup sites referenced (the code paths invoked by closeAndCleanup as
well as Discard() and Commit()) so DuckDB WAL files are removed correctly.
🟡 Minor comments (6)
duckdb.go-444-449 (1)

444-449: ⚠️ Potential issue | 🟡 Minor

Wrap scoring write failures with method context.

Returning the raw driver error loses the operation name and breaks the repository's public error-format convention.

As per coding guidelines, "**/*.go: Error strings should follow format: \"store.Method: what failed\" — self-identifying without stack traces".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@duckdb.go` around lines 444 - 449, The WriteScoringResult method returns the
raw driver error from db.conn.Exec which violates the project's error-format
convention; update WriteScoringResult to wrap the returned error with a
self-identifying message like "store.WriteScoringResult: insert scoring result
failed" (using fmt.Errorf and %w) so failures include the method context while
keeping the original error, and make this change around the db.conn.Exec call in
the WriteScoringResult function.
README.md-1-2 (1)

1-2: ⚠️ Potential issue | 🟡 Minor

Keep the badge text in UK English as well.

The badge label still says License, which is inconsistent with the rest of the repo's Licence spelling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 1 - 2, Update the badge label from US to UK spelling:
change the alt text "[![License: EUPL-1.2]" to "[![Licence: EUPL-1.2]" in
README.md (the markdown badge line containing "[![License:
EUPL-1.2](https://img.shields.io...)](LICENSE.md)"); leave the image URL and
target link (LICENSE.md) intact unless you also intend to rename the LICENSE
file—only the visible badge text should be switched to "Licence".
events.go-30-30 (1)

30-30: ⚠️ Potential issue | 🟡 Minor

Call out the EventDeleteGroup.String() wire-value change explicitly.

Changing from delete_group to deletegroup is externally observable; downstream filters/parsers may break silently without release-note/migration callouts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@events.go` at line 30, The EventDeleteGroup.String() implementation changed
the wire value from "delete_group" to "deletegroup", which is a breaking
external change; revert the returned string in EventDeleteGroup.String() back to
"delete_group" to preserve the original wire-format or, if intentional, add an
explicit migration/release-note and update any parsing/filtering code that
depends on the old value; reference EventDeleteGroup.String() and ensure tests
that assert its wire output are updated accordingly.
docs/architecture.md-193-193 (1)

193-193: ⚠️ Potential issue | 🟡 Minor

Fix the NewScopedConfigured signature in prose.

Both changed sentences omit the first *Store argument and currently describe a non-callable form.

✏️ Proposed docs patch
-`ScopedStore` wraps a `*Store` and automatically prefixes all group names with `namespace + ":"`. This prevents key collisions when multiple tenants share a single underlying database. When the namespace and quota are already known, prefer `NewScopedConfigured(store.ScopedStoreConfig{...})` so the configuration is explicit at the call site.
+`ScopedStore` wraps a `*Store` and automatically prefixes all group names with `namespace + ":"`. This prevents key collisions when multiple tenants share a single underlying database. When the namespace and quota are already known, prefer `NewScopedConfigured(storeInstance, store.ScopedStoreConfig{...})` so the configuration is explicit at the call site.
@@
-`NewScopedConfigured(store.ScopedStoreConfig{...})` is the preferred way to set per-namespace limits because the quota values stay visible at the call site. For example, `store.QuotaConfig{MaxKeys: 100, MaxGroups: 10}` caps a namespace at 100 keys and 10 groups:
+`NewScopedConfigured(storeInstance, store.ScopedStoreConfig{...})` is the preferred way to set per-namespace limits because the quota values stay visible at the call site. For example, `store.QuotaConfig{MaxKeys: 100, MaxGroups: 10}` caps a namespace at 100 keys and 10 groups:

Also applies to: 218-218

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture.md` at line 193, The prose incorrectly omits the first
*Store argument in the `NewScopedConfigured` signature; update the sentences to
show the full callable form including the `*Store` parameter (e.g.,
`NewScopedConfigured(*Store, store.ScopedStoreConfig{...})`) and adjust the
surrounding text to refer to passing the store plus config so it reads as a
function call; ensure the same fix is applied at the second occurrence
referencing `NewScopedConfigured`.
doc.go-7-8 (1)

7-8: ⚠️ Potential issue | 🟡 Minor

Correct the documented NewScopedConfigured call shape.

The prose currently shows store.NewScopedConfigured(store.ScopedStoreConfig{...}), but the API requires the parent store argument first.

✏️ Proposed docs patch
-// Use `store.NewScopedConfigured(store.ScopedStoreConfig{...})` when the
+// Use `store.NewScopedConfigured(configuredStore, store.ScopedStoreConfig{...})` when the
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@doc.go` around lines 7 - 8, The docs show the wrong call shape for
NewScopedConfigured; update examples to pass the parent store first and the
config second by using NewScopedConfigured(parentStore,
store.ScopedStoreConfig{...}) (referencing NewScopedConfigured and
ScopedStoreConfig) so the API signature matches the implementation.
scope_test.go-1340-1342 (1)

1340-1342: ⚠️ Potential issue | 🟡 Minor

keyName wraps after 26 inputs and weakens the quota tests.

string(rune('a' + i%26)) repeats values after i == 25, so the tests that claim to create 50 or 100 unique groups/keys only exercise 26 uniques. That can hide quota regressions.

Suggested fix
 func keyName(i int) string {
-	return "key-" + string(rune('a'+i%26))
+	return core.Sprintf("key-%d", i)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scope_test.go` around lines 1340 - 1342, The keyName function currently wraps
after 26 values because it converts i to a single letter via
string(rune('a'+i%26)), weakening quota tests; update keyName (the keyName
function) to produce a unique name per integer (e.g., append the full decimal or
a non-wrapping encoding of i such as strconv.Itoa(i) or fmt.Sprintf("%d", i)) so
that calls with i up to 50/100 produce distinct keys; replace the rune-based
conversion with a stable integer-to-string conversion to ensure uniqueness
across all test inputs.
🧹 Nitpick comments (8)
tests/cli/store/Taskfile.yaml (1)

3-5: Consider adding a vet task for static analysis.

Based on learnings, the project verification workflow includes go vet ./.... Adding a dedicated vet task would improve developer automation and ensure consistent static analysis checks.

🔧 Proposed addition of vet task
   default:
-    deps: [build, test]
+    deps: [build, vet, test]
 
   build:
     dir: ../../..
     cmds:
       - go build ./...
 
+  vet:
+    dir: ../../..
+    cmds:
+      - go vet ./...
+
   test:
     dir: ../../..
     cmds:
       - go test -count=1 -race ./...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/cli/store/Taskfile.yaml` around lines 3 - 5, The Taskfile currently
defines a default task with deps [build, test] but lacks a vet step; add a new
task named "vet" that runs the static analysis command (e.g., "go vet ./...")
and include "vet" as a dependency in the "default" task's deps array so the
default workflow runs static analysis automatically; locate the "tasks" block
and modify the "default" task and add a new "vet" task entry accordingly.
json.go (2)

47-47: Prefer a descriptive parameter name in exported API.

v is terse for a public helper; value is clearer.

As per coding guidelines: "Prefer descriptive names over abbreviations in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@json.go` at line 47, Rename the terse parameter v to a descriptive name like
value in the exported function MarshalIndent to improve API clarity; update the
function signature (func MarshalIndent(value any, prefix, indent string)
([]byte, error)) and every internal and external call site that references
MarshalIndent, plus any associated docs/comments and tests that mention the old
parameter name, ensuring compilation and behavior remain unchanged.

14-18: Use concrete values in public usage examples.

These examples are still placeholder-style (raw, entry, compact) rather than copy-pasteable snippets with real values.

As per coding guidelines: "Public comments should show real usage with concrete values".

Also applies to: 23-23, 33-33, 46-46, 73-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@json.go` around lines 14 - 18, Replace the placeholder-style example values
in the public comments with concrete, copy-pastable snippets: for the CacheEntry
example (type CacheEntry struct { Data store.RawMessage `json:"data"` }), and
the other occurrences noted, use real JSON byte literals and realistic variable
names (e.g., jsonData := []byte(`{"name":"Alice"}`), cacheEntry :=
CacheEntry{Data: jsonData}, compactJSON := []byte(`{"name":"Alice"}`)) so the
examples compile and demonstrate actual usage of store.RawMessage and
CacheEntry; update all instances called out (the examples at the five locations)
to follow this concrete pattern.
journal_test.go (1)

190-190: Minor: Unusual whitespace in function call.

The commitJournalEntry call has atypical internal spacing. This appears to be a formatting artifact.

🧹 Suggested formatting fix
-	assertNoError(t, commitJournalEntry( storeInstance.sqliteDatabase, "events", "session-b", `{"like":2}`, `{"workspace":"session-b"}`, time.Now().UnixMilli(), ))
+	assertNoError(t, commitJournalEntry(storeInstance.sqliteDatabase, "events", "session-b", `{"like":2}`, `{"workspace":"session-b"}`, time.Now().UnixMilli()))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@journal_test.go` at line 190, The commitJournalEntry invocation has odd
internal spacing/line breaks; normalize the call to standard formatting by
removing the extra spaces and trailing comma inside the argument list so the
call reads as a normal single-line (or consistently indented multi-line)
function call—locate the call to commitJournalEntry (using
storeInstance.sqliteDatabase, "events", "session-b", `{"like":2}`,
`{"workspace":"session-b"}`, time.Now().UnixMilli()) and adjust
whitespace/commas so arguments are cleanly separated and consistent with project
formatting.
test_asserts_test.go (1)

210-222: Consider handling multi-error unwrapping.

The errIs function handles single-error unwrapping via Unwrap() error, but Go 1.20+ introduced Unwrap() []error for joined errors. If tests need to match errors wrapped with errors.Join, this implementation won't traverse them.

For test utilities this is likely acceptable, but worth noting if you use errors.Join elsewhere.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test_asserts_test.go` around lines 210 - 222, The errIs helper currently only
follows single-error wrappers via the Unwrap() error pattern; update errIs to
also detect and handle multi-error unwrapping (interface Unwrap() []error)
introduced in Go1.20 so it can traverse errors created by errors.Join.
Specifically, in errIs check for the multi-unwrapper type (Unwrap() []error) and
recursively inspect each error in the returned slice (falling back to the
existing single Unwrap() error path) until a match with the target is found;
keep the function name errIs and preserve the existing single-error logic while
adding the multi-error branch.
import_export_test.go (1)

8-9: Inconsistent testify usage across test suite.

This file imports and uses testify/assert and testify/require, whilst other test files in this PR (journal_test.go, workspace_test.go, recover_test.go, path_test.go, test_helpers_test.go) have migrated to the local assertion helpers defined in test_asserts_test.go.

Consider migrating this file to use the same local helpers for consistency.

♻️ Proposed migration to local helpers
 import (
 	"testing"
-
-	"github.com/stretchr/testify/assert"
-	"github.com/stretchr/testify/require"
 )

Then replace:

  • require.NoError(t, err)assertNoError(t, err)
  • require.Error(t, ...)assertError(t, ...)
  • assert.Equal(t, want, got)assertEqual(t, want, got)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@import_export_test.go` around lines 8 - 9, This test imports
github.com/stretchr/testify/assert and require; replace those imports with the
local test helpers (from test_asserts_test.go) and migrate usages: change
require.NoError(t, err) to assertNoError(t, err), require.Error(t, ...) to
assertError(t, ...), and assert.Equal(t, want, got) to assertEqual(t, want,
got); ensure any other testify calls are mapped to the corresponding local
helpers and remove the testify import lines so the file uses only the local
assertion helpers.
workspace.go (1)

42-47: Collapse db and sqliteDatabase into a single canonical field.

These now represent the same handle, and ensureReady already has to repair drift between them. Keeping both makes lifecycle bugs much easier to introduce during the DuckDB migration. Based on learnings: Do not add compatibility aliases; the primary API names are the contract.

Also applies to: 83-91

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspace.go` around lines 42 - 47, The Workspace struct currently has two
fields db and sqliteDatabase that represent the same SQL handle; collapse them
into a single canonical field (choose one name, e.g., db) by removing
sqliteDatabase from the struct and updating all references and assignments
(including in ensureReady, constructors, Close/cleanup methods, and any methods
that read/write sqliteDatabase) to use the single canonical field; remove any
compatibility aliases or duplicate maintenance logic and ensure ensureReady
still reconciles any drift but only writes to the canonical field (Workspace.db)
and that all call sites (methods, tests, initializers) are updated to the chosen
field name.
medium.go (1)

234-262: Remove unused wasEscaped variable.

The wasEscaped variable is assigned at lines 248 and 254 but only referenced to silence the compiler at line 260. This is dead code.

🧹 Proposed fix to remove dead code
 func splitCSVLine(line string) []string {
 	line = trimTrailingCarriageReturn(line)
 	buffer := core.NewBuffer()
 	var (
-		fields     []string
-		inQuotes   bool
-		wasEscaped bool
+		fields   []string
+		inQuotes bool
 	)
 	for index := 0; index < len(line); index++ {
 		character := line[index]
 		switch {
 		case character == '"' && inQuotes && index+1 < len(line) && line[index+1] == '"':
 			buffer.WriteByte('"')
 			index++
-			wasEscaped = true
 		case character == '"':
 			inQuotes = !inQuotes
 		case character == ',' && !inQuotes:
 			fields = append(fields, buffer.String())
 			buffer.Reset()
-			wasEscaped = false
 		default:
 			buffer.WriteByte(character)
 		}
 	}
 	fields = append(fields, buffer.String())
-	_ = wasEscaped
 	return fields
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@medium.go` around lines 234 - 262, The variable wasEscaped in splitCSVLine is
unused/dead code; remove its declaration and all assignments to it (the
assignments inside the escaped-quote case and after appending a field) and
delete the final "_ = wasEscaped" discard line so the function only uses buffer,
fields and inQuotes, preserving current CSV parsing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compact.go`:
- Around line 180-181: The code currently calls medium.Write(outputPath,
compressedArchive) before the subsequent DB transaction (Begin/Exec/Commit),
which can leave an archive file if the DB commit later fails; modify the flow in
the compact.go routine (the path using medium.Write, outputPath and
compressedArchive) to write the archive to a temporary/staging path first (e.g.,
outputPath+".tmp" or a unique temp filename), then perform the DB transaction
(Begin/Exec/Commit) to insert journal/archive metadata, and only after a
successful Commit atomically rename/publish the staged file to outputPath (or
otherwise make the external write idempotent); ensure cleanup of the temp file
on failure.
- Around line 262-269: compactArchiveBuffer.Write currently rereads the whole
value via medium.Read and rewrites it on every call (using buffer.path), causing
quadratic behavior; change it to buffer writes to an in-memory streaming
accumulator (e.g., add a bytes.Buffer field on compactArchiveBuffer or an
io.WriteCloser stream) and append new bytes to that buffer instead of calling
medium.Read each time, then flush once to medium.Write (or use a streaming
append method on medium if available) so Write(data []byte) simply does
buffer.buf.Write(data) and only performs medium.Write(buffer.buf.Bytes()) on
Close/Flush.

In `@conventions_test.go`:
- Line 42: Replace the custom test helper calls with testify's assert/require
helpers and add the testify imports; specifically swap calls like
assertEmptyf(t, ...) to assert.Emptyf(t, ...) (use assert.* for verifications)
and ensure precondition checks use require.* (e.g.,
require.NoErrorf/require.NotNilf where appropriate), and add
"github.com/stretchr/testify/assert" and "github.com/stretchr/testify/require"
to the test imports so conventions_test.go (and the other listed occurrences)
use the repository's standard testify helpers.

In `@duckdb.go`:
- Around line 419-437: EnsureScoringTables currently discards errors from the
three CREATE TABLE Exec calls; change its signature to return an error (func (db
*DuckDB) EnsureScoringTables() error) and propagate any Exec error instead of
ignoring it: call db.conn.Exec(...) for each of the three DDL statements (the
ones using TableCheckpointScores, TableProbeResults, and the "scoring_results"
literal), check the returned error after each Exec, and return the first non-nil
error (or wrap with context) so callers know schema creation failed.
- Around line 8-10: Replace the non-pure-Go, CGO-using driver import
"github.com/marcboeker/go-duckdb" with the approved pure-Go driver
"modernc.org/sqlite" (remove any CGO requirement and ensure imports/DB driver
name usage point to modernc.org/sqlite); in EnsureScoringTables (function
EnsureScoringTables) stop discarding Exec() results — check each Exec() error
and return a wrapped error using core.E (e.g.,
core.E("store.EnsureScoringTables: failed to execute DDL", err)) instead of
ignoring it; and in WriteScoringResult (function WriteScoringResult) wrap any
returned driver errors with core.E (e.g., core.E("store.WriteScoringResult: what
failed", err)) before returning so callers get consistent wrapped errors.

In `@events.go`:
- Around line 86-88: The checks of storeInstance.isClosed must be performed
under lifecycleLock before acquiring any secondary locks to preserve lock
ordering and avoid deadlock: in Watch(), OnChange(), and notify() move the
closed-state check out of the watcherLock / callbackLock / watcherLock.RLock()
sections so you first acquire storeInstance.lifecycleLock, read isClosed, then
release lifecycleLock and only after that take watcherLock/callbackLock (or,
alternatively, acquire lifecycleLock, check isClosed and then while still
holding lifecycleLock acquire the secondary lock if needed); ensure
Store.Close() still acquires lifecycleLock before watcherLock/callbackLock and
update code paths referencing lifecycleLock, watcherLock, callbackLock, Watch,
OnChange, notify and isClosed accordingly.

In `@go.mod`:
- Line 42: Remove the CGO-dependent module github.com/marcboeker/go-duckdb from
go.mod and replace or justify it: delete the dependency line for
github.com/marcboeker/go-duckdb, update any code importing that module (search
for imports referencing github.com/marcboeker/go-duckdb or package names it
provides) to either use a pure-Go alternative or refactor to avoid DuckDB, and
if keeping it is unavoidable add a clear justification and exception note in the
repository README/DEPENDENCIES explaining the CGO requirement and EUPL-1.2
implications; ensure go mod tidy yields no leftover references.

In `@import.go`:
- Around line 143-145: The code is duplicating "training" in constructed paths
(e.g., trainingLocal := core.JoinPath(cfg.DataDir, "training") plus rel which
already contains "training/..."), causing wrong directories like
<DataDir>/training/training/...; to fix, stop pre-pending the "training" segment
when joining with rel—use core.JoinPath(cfg.DataDir, rel) (or compute
trainingLocal from rel's parent only) and ensure directories are created with
the resulting path before calling localFs.EnsureDir; update the other
occurrences referenced (around lines using trainingLocal, rel, core.JoinPath and
localFs.EnsureDir at the mentioned blocks) so all joins consistently use rel
rather than adding an extra "training" prefix.
- Around line 233-240: The remote SCP path for standalone benchmark files uses
the wrong directory name ("benchmark") causing fetches to fail; update the
core.Sprintf call that builds `remote` in the loop over `bfile` (where `local`,
`cfg.SkipM3`, `cfg.Scp`, and `m3Host` are used) to use
"/Volumes/Data/lem/benchmarks/%s.jsonl" (plural "benchmarks") so it matches the
earlier benchmark pulls and allows the fallback SCP to succeed.
- Around line 205-211: The code creates localSub with localFs.EnsureDir but then
calls cfg.ScpDir into the parent benchLocal directory, so the copied files may
not end up in benchLocal/<subdir> where the importer scans; update the call site
that invokes cfg.ScpDir (inside the loop over subdir) to use localSub as the
target directory (e.g., pass localSub or localSub + "/" as the destination)
instead of core.JoinPath(benchLocal)+"/", keeping the rest of the logic (remote
construction using m3Host and core.Sprintf) unchanged.
- Around line 158-169: The code is ignoring errors from DB operations (db.Exec
and INSERTs) and still increments the import counter and returns nil; update the
DB calls used in ImportAll (and the helper loops that increment count) to check
and handle errors: capture the error returned by db.Exec and by each
prepared/Exec/Query call, only increment the per-type and total count on
success, and return or accumulate an error (or at least mark a non-zero error
state) so ImportAll returns a non-nil error when any CREATE/INSERT fails; update
the blocks around db.Exec(...) and the loops that increment count (the helper
that updates 'count') to branch on err != nil and log/propagate the error
instead of discarding it.

In `@json.go`:
- Around line 82-87: The helper closure writeNewlineIndent inside
indentCompactJSON and all buf.WriteByte / buf.WriteString calls currently ignore
returned errors and may swallow write failures; modify writeNewlineIndent to
return an error and propagate it to the caller, change all invocations to check
and return any non-nil error, and update each WriteByte/WriteString call in
indentCompactJSON to capture and handle their error returns (returning the error
from indentCompactJSON) so the function properly reports write failures instead
of succeeding after a failed write.

In `@parquet.go`:
- Around line 125-166: The code currently accumulates all parsed ParquetRow
instances in the rows slice inside the scanner loop (using ParquetRow, rows and
the scanner.Scan() block), causing OOM on large splits; instead, create/use the
Parquet writer (the writer you use elsewhere to write ParquetRow) and write each
parsed row immediately after building it (after constructing
Prompt/Response/System/Messages from core.JSONUnmarshal and
core.JSONMarshalString) rather than appending to rows, and optionally
flush/close the writer or call its periodic Flush/WriteRow method (depending on
your writer API) to keep memory constant; update both the loop shown and the
similar block at lines 185-203 to stream rows directly to the Parquet writer.
- Around line 8-10: The new import github.com/parquet-go/parquet-go in
parquet.go introduces a disallowed runtime dependency (and an incompatible
license); remove the Parquet runtime dependency from the core package by either
deleting the Parquet export code from parquet.go or moving all Parquet-specific
code and its imports out of the core package into a separate command/tool
package (e.g., cmd/parquet-export) or a build-tagged file so core no longer
imports github.com/parquet-go/parquet-go; ensure any public APIs used by the
tool are exposed only via lightweight interfaces in core (no direct parquet-go
import) and update imports to reference the new tool package instead.

In `@publish.go`:
- Around line 167-183: The code currently calls localFs.Read and converts the
result into a []byte (raw) before creating the HTTP request, which loads the
entire file into memory; change uploadFileToHF to open and stream the file
instead of reading it all at once by replacing localFs.Read(...) and the raw
conversion with a reader (e.g., use localFs.Open or equivalent to obtain an
io.Reader/stream) and pass that reader directly into http.NewRequest so the
request body is streamed; if possible obtain and set Content-Length from the
file metadata before sending to avoid chunked encoding and keep existing error
handling around http.NewRequest, req.Header, and client.Do intact.
- Around line 34-39: PublishConfig.Public is never applied in the real publish
path: update the code that performs the actual create-or-update of the dataset
(the real publish path that mirrors the dry-run) to use PublishConfig.Public
when setting dataset visibility; specifically, read cfg.Public from the
PublishConfig and pass it into the dataset creation/update call (or set the
dataset.Visibility/IsPublic property) so that callers with Public: true or false
get corresponding behavior; ensure the same visibility logic used in the dry-run
printout and in the actual createDataset/updateDataset (or publishDataset) calls
is unified and applied in both paths.

In `@scope.go`:
- Around line 561-575: Add the same readiness guard used by other ScopedStore
methods: if scopedStore.store == nil return an error instead of dereferencing
it. In ScopedStore.Transaction, after checking operation and scopedStore != nil,
add a check `if scopedStore.store == nil { return
core.E("store.ScopedStore.Transaction", "scoped store store is nil", nil) }` so
the method returns the standard error instead of panicking when ScopedStore is a
non-nil zero value; then proceed to call scopedStore.store.Transaction as
before.

---

Outside diff comments:
In `@workspace.go`:
- Around line 470-472: The cleanup currently appends SQLite suffixes ("-wal",
"-shm") to workspace.databasePath when deleting sidecars; detect DuckDB files
(when workspace.databasePath endsWith ".duckdb") and instead target DuckDB
sidecars (append ".wal" to produce "<name>.duckdb.wal" and skip the "-shm"
variant), using workspace.filesystem.Delete and workspace.filesystem.Exists
accordingly; apply the same branching change in the other cleanup sites
referenced (the code paths invoked by closeAndCleanup as well as Discard() and
Commit()) so DuckDB WAL files are removed correctly.

---

Minor comments:
In `@doc.go`:
- Around line 7-8: The docs show the wrong call shape for NewScopedConfigured;
update examples to pass the parent store first and the config second by using
NewScopedConfigured(parentStore, store.ScopedStoreConfig{...}) (referencing
NewScopedConfigured and ScopedStoreConfig) so the API signature matches the
implementation.

In `@docs/architecture.md`:
- Line 193: The prose incorrectly omits the first *Store argument in the
`NewScopedConfigured` signature; update the sentences to show the full callable
form including the `*Store` parameter (e.g., `NewScopedConfigured(*Store,
store.ScopedStoreConfig{...})`) and adjust the surrounding text to refer to
passing the store plus config so it reads as a function call; ensure the same
fix is applied at the second occurrence referencing `NewScopedConfigured`.

In `@duckdb.go`:
- Around line 444-449: The WriteScoringResult method returns the raw driver
error from db.conn.Exec which violates the project's error-format convention;
update WriteScoringResult to wrap the returned error with a self-identifying
message like "store.WriteScoringResult: insert scoring result failed" (using
fmt.Errorf and %w) so failures include the method context while keeping the
original error, and make this change around the db.conn.Exec call in the
WriteScoringResult function.

In `@events.go`:
- Line 30: The EventDeleteGroup.String() implementation changed the wire value
from "delete_group" to "deletegroup", which is a breaking external change;
revert the returned string in EventDeleteGroup.String() back to "delete_group"
to preserve the original wire-format or, if intentional, add an explicit
migration/release-note and update any parsing/filtering code that depends on the
old value; reference EventDeleteGroup.String() and ensure tests that assert its
wire output are updated accordingly.

In `@README.md`:
- Around line 1-2: Update the badge label from US to UK spelling: change the alt
text "[![License: EUPL-1.2]" to "[![Licence: EUPL-1.2]" in README.md (the
markdown badge line containing "[![License:
EUPL-1.2](https://img.shields.io...)](LICENSE.md)"); leave the image URL and
target link (LICENSE.md) intact unless you also intend to rename the LICENSE
file—only the visible badge text should be switched to "Licence".

In `@scope_test.go`:
- Around line 1340-1342: The keyName function currently wraps after 26 values
because it converts i to a single letter via string(rune('a'+i%26)), weakening
quota tests; update keyName (the keyName function) to produce a unique name per
integer (e.g., append the full decimal or a non-wrapping encoding of i such as
strconv.Itoa(i) or fmt.Sprintf("%d", i)) so that calls with i up to 50/100
produce distinct keys; replace the rune-based conversion with a stable
integer-to-string conversion to ensure uniqueness across all test inputs.

---

Nitpick comments:
In `@import_export_test.go`:
- Around line 8-9: This test imports github.com/stretchr/testify/assert and
require; replace those imports with the local test helpers (from
test_asserts_test.go) and migrate usages: change require.NoError(t, err) to
assertNoError(t, err), require.Error(t, ...) to assertError(t, ...), and
assert.Equal(t, want, got) to assertEqual(t, want, got); ensure any other
testify calls are mapped to the corresponding local helpers and remove the
testify import lines so the file uses only the local assertion helpers.

In `@journal_test.go`:
- Line 190: The commitJournalEntry invocation has odd internal spacing/line
breaks; normalize the call to standard formatting by removing the extra spaces
and trailing comma inside the argument list so the call reads as a normal
single-line (or consistently indented multi-line) function call—locate the call
to commitJournalEntry (using storeInstance.sqliteDatabase, "events",
"session-b", `{"like":2}`, `{"workspace":"session-b"}`, time.Now().UnixMilli())
and adjust whitespace/commas so arguments are cleanly separated and consistent
with project formatting.

In `@json.go`:
- Line 47: Rename the terse parameter v to a descriptive name like value in the
exported function MarshalIndent to improve API clarity; update the function
signature (func MarshalIndent(value any, prefix, indent string) ([]byte, error))
and every internal and external call site that references MarshalIndent, plus
any associated docs/comments and tests that mention the old parameter name,
ensuring compilation and behavior remain unchanged.
- Around line 14-18: Replace the placeholder-style example values in the public
comments with concrete, copy-pastable snippets: for the CacheEntry example (type
CacheEntry struct { Data store.RawMessage `json:"data"` }), and the other
occurrences noted, use real JSON byte literals and realistic variable names
(e.g., jsonData := []byte(`{"name":"Alice"}`), cacheEntry := CacheEntry{Data:
jsonData}, compactJSON := []byte(`{"name":"Alice"}`)) so the examples compile
and demonstrate actual usage of store.RawMessage and CacheEntry; update all
instances called out (the examples at the five locations) to follow this
concrete pattern.

In `@medium.go`:
- Around line 234-262: The variable wasEscaped in splitCSVLine is unused/dead
code; remove its declaration and all assignments to it (the assignments inside
the escaped-quote case and after appending a field) and delete the final "_ =
wasEscaped" discard line so the function only uses buffer, fields and inQuotes,
preserving current CSV parsing behavior.

In `@test_asserts_test.go`:
- Around line 210-222: The errIs helper currently only follows single-error
wrappers via the Unwrap() error pattern; update errIs to also detect and handle
multi-error unwrapping (interface Unwrap() []error) introduced in Go1.20 so it
can traverse errors created by errors.Join. Specifically, in errIs check for the
multi-unwrapper type (Unwrap() []error) and recursively inspect each error in
the returned slice (falling back to the existing single Unwrap() error path)
until a match with the target is found; keep the function name errIs and
preserve the existing single-error logic while adding the multi-error branch.

In `@tests/cli/store/Taskfile.yaml`:
- Around line 3-5: The Taskfile currently defines a default task with deps
[build, test] but lacks a vet step; add a new task named "vet" that runs the
static analysis command (e.g., "go vet ./...") and include "vet" as a dependency
in the "default" task's deps array so the default workflow runs static analysis
automatically; locate the "tasks" block and modify the "default" task and add a
new "vet" task entry accordingly.

In `@workspace.go`:
- Around line 42-47: The Workspace struct currently has two fields db and
sqliteDatabase that represent the same SQL handle; collapse them into a single
canonical field (choose one name, e.g., db) by removing sqliteDatabase from the
struct and updating all references and assignments (including in ensureReady,
constructors, Close/cleanup methods, and any methods that read/write
sqliteDatabase) to use the single canonical field; remove any compatibility
aliases or duplicate maintenance logic and ensure ensureReady still reconciles
any drift but only writes to the canonical field (Workspace.db) and that all
call sites (methods, tests, initializers) are updated to the chosen field name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 302a3209-afdd-4543-aa08-088dfc740216

📥 Commits

Reviewing files that changed from the base of the PR and between 14cfae9 and 85ab185.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (37)
  • CLAUDE.md
  • README.md
  • compact.go
  • compact_test.go
  • conventions_test.go
  • coverage_test.go
  • doc.go
  • docs/RFC-STORE.md
  • docs/architecture.md
  • docs/index.md
  • duckdb.go
  • events.go
  • events_test.go
  • go.mod
  • import.go
  • import_export_test.go
  • inventory.go
  • journal.go
  • journal_test.go
  • json.go
  • medium.go
  • medium_test.go
  • parquet.go
  • path_test.go
  • publish.go
  • recover_test.go
  • scope.go
  • scope_test.go
  • store.go
  • store_test.go
  • test_asserts_test.go
  • test_helpers_test.go
  • tests/cli/store/Taskfile.yaml
  • transaction.go
  • transaction_test.go
  • workspace.go
  • workspace_test.go

Comment thread compact.go Outdated
Comment thread compact.go Outdated
Comment thread conventions_test.go
Comment thread duckdb.go
Comment thread duckdb.go Outdated
Comment thread parquet.go Outdated
Comment thread parquet.go Outdated
Comment thread publish.go
Comment thread publish.go Outdated
Comment thread scope.go
@Snider Snider changed the title chore(dev→main): routine PR cadence for CodeRabbit review feat: add store workspace import and publish workflows Apr 27, 2026
@Snider
Copy link
Copy Markdown
Contributor Author

Snider commented Apr 27, 2026

RESOLVED-COMMENT: CodeRabbit pre-merge docstring/title/config checks re-verified locally. The PR title is now feat: add store workspace import and publish workflows. For doc comments, the local AX-2 convention tests enforce Usage example: comments on all exported declarations and exported struct fields, and GOWORK=off go test -count=1 -run TestConventions_Exports_Good ./... passes. .golangci.yml is migrated to v2 format and golangci-lint config verify plus golangci-lint run ./... pass locally.

30+ findings dispositioned across compact / publish / events / import
/ workspace / config / docs / test surfaces.

Major code fixes:
- compact.go: stage archive, DB commit first, publish after commit
  (durability ordering bug)
- compact_test.go: replaced medium reread O(n²) with bytes.Buffer
- coverage_test.go: EnsureScoringTables now wraps + returns DDL errors
- events.go: lifecycle lock-order reordered (deadlock risk)
- import.go: paths join against cfg.DataDir; SQL write errors propagate;
  counts increment only on success
- bench_test.go: ScpDir + /benchmarks/ paths corrected
- json.go: formatter write errors propagate (no silent drops)
- parquet.go: removed runtime parquet-go dep + in-core writer (was
  buffering whole rows in memory — OOM risk on large datasets)
- publish.go: PublishConfig.Public uses private: !public
- HuggingFace upload streams file with content length (was buffering)
- store.go: ScopedStore.Transaction nil-store panic guard
- recover_test.go: handle .duckdb.wal sidecars; skip SQLite -shm for DuckDB
- coverage_test.go: WriteScoringResult wraps insert failures
- events_test.go: restored EventDeleteGroup wire value 'delete_group'
- transaction.go: NewScopedConfigured docs include parent store arg
- store_test.go: scope_test keyName uses non-wrapping integer names
- import_export_test.go: testify → stdlib helpers (AX-6 conformance)
- store.go: collapsed duplicate workspace DB fields to canonical 'db'

Doc / config:
- README.md: 'Licence' UK English on badge
- docs/architecture.md: clearer event ordering + lifecycle docs
- .golangci.yml: migrated to golangci-lint v2 schema
- Taskfile: default includes vet task
- JSON: terse 'value' param + concrete examples

Disposition replies (RESOLVED-COMMENT, no code change):
- conventions_test.go testify suggestion: AX-6 banned testify; stdlib helpers are convention
- DuckDB CGO/MIT critical: retained as documented exception in
  DEPENDENCIES.md (load-bearing existing dependency for the workspace
  store; replacement is its own engineering ticket)

Verification: GOWORK=off go vet + go test -count=1 ./... pass.
golangci-lint run ./... reports 0 issues. gofmt -l clean. git diff
--check clean.

Closes findings on #4

Co-authored-by: Codex <noreply@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
workspace.go (1)

321-329: ⚠️ Potential issue | 🟠 Major

Commit() becomes non-idempotent once the store write succeeds.

If commitWorkspaceAggregate() succeeds and closeAndRemoveFiles() then fails, the caller gets OK: false even though the journal row and workspace:NAME/summary entry are already durable. Because RecoverOrphans() will still reopen the leftover .duckdb file, a retry can publish the same aggregate again.

This needs a committed sentinel or a recovery filter so cleanup failures do not turn into duplicate commits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspace.go` around lines 321 - 329, Commit() can report failure after the
durable store write (commitWorkspaceAggregate) succeeds because
closeAndRemoveFiles() failed, making commits non-idempotent and allowing
duplicate aggregates on retry; fix by making the commit atomic from the caller’s
POV: have commitWorkspaceAggregate (or an adjacent store call) write a durable
"committed" sentinel/marker for workspace.name before attempting
closeAndRemoveFiles(), and treat closeAndRemoveFiles() failures as cleanup-only
(return OK:true with a logged cleanup error) so callers don’t retry the
aggregate; alternatively ensure RecoverOrphans() checks that sentinel and skips
re-publishing aggregates for workspaces already marked committed; update
Workspace.Commit(), commitWorkspaceAggregate, closeAndRemoveFiles(), and
RecoverOrphans() accordingly so a successful store write is permanently
considered committed even if cleanup fails.
compact.go (1)

188-223: ⚠️ Potential issue | 🔴 Critical

Don’t delete the only staged archive after the DB commit succeeds.

Once transaction.Commit() has succeeded, a later medium.Rename(...) failure is no longer recoverable from the database side. The deferred cleanup then deletes *.tmp, so the journal rows stay marked as archived while the archive payload is lost and won’t be picked up by the next compaction run.

This path needs to preserve the staged object after the point-of-no-return, or otherwise compensate for a failed publish before returning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compact.go` around lines 188 - 223, The deferred cleanup currently can delete
the only staged archive after transaction.Commit() succeeds, losing the payload
if medium.Rename(...) then fails; to fix, mark the staged artifact as preserved
before the point-of-no-return so the defer won’t delete it: set
stagedOutputPublished = true immediately after a successful transaction.Commit()
(and before calling medium.Rename), so on a subsequent rename failure the staged
file is left in place for recovery/retry and you return the error; keep the
existing rollback/defer logic for transaction and only delete stagedOutputPath
when stagedOutputPublished is false.
🧹 Nitpick comments (1)
medium.go (1)

18-18: Use a StoreConfig literal in this public usage example.

Line 18 shows an option chain even though the configuration values are already known; align the example with the configured-constructor pattern used elsewhere in this file.

As per coding guidelines: "Prefer StoreConfig and ScopedStoreConfig literals over option chains when the configuration is already known".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@medium.go` at line 18, Replace the option-chain usage in the public example
with a StoreConfig literal: instead of calling store.New(":memory:",
store.WithMedium(medium)), construct a StoreConfig (or ScopedStoreConfig if
relevant) with the Medium field set to the local medium instance and pass that
to store.New (e.g., create storeCfg := store.StoreConfig{Medium: medium} and
call store.New(":memory:", storeCfg) or the configured-constructor variant used
elsewhere); update the example line that currently references local.New and
store.New to demonstrate creating medium via local.New and then passing a
StoreConfig literal to store.New.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@duckdb.go`:
- Around line 8-10: The duckdb.go file imports "github.com/marcboeker/go-duckdb"
and causes typecheck failures when CGO is disabled (undefined: Conn); fix by
either adding a CGO-only build tag to the file (add a top-of-file line
//go:build cgo and the legacy // +build cgo) so the file is excluded when CGO
isn't enabled, and update DEPENDENCIES.md to document that duckdb.go requires
CGO; or enable CGO in CI by setting CGO_ENABLED=1 in the build action (ensure
the reused action dAppCore/build/actions/build/core@dev receives CGO_ENABLED=1)
so the package builds with duckdb.go present. Ensure you reference duckdb.go and
the import of "github.com/marcboeker/go-duckdb" when making the change.

In `@import.go`:
- Around line 311-323: ImportAll currently scans hardcoded paths
(/tmp/lem-data/seeds, /tmp/lem-repo/seeds) which makes seed discovery
non-deterministic; change the seed discovery in the ImportAll block that builds
seedDirs so it only uses the configured data dir (core.JoinPath(cfg.DataDir,
"seeds")) or explicit ImportConfig values instead of the two /tmp paths, and
keep calling importSeeds(db, seedDir) and accumulating seedTotal as before
(update the seedDirs variable and any ordering so discovery is deterministic);
refer to symbols ImportAll, seedDirs, cfg.DataDir, importSeeds, and
totals["seeds"] to locate and modify the code.
- Around line 479-493: The code currently swallows read/parse failures by
returning early from the import loop (using readResult and core.JSONUnmarshal)
which makes partial imports appear successful; change the handling so read and
JSON parse errors for the file at path (use readResult, data, core.JSONUnmarshal
on raw) are surfaced to the caller (or the import summary) instead of silently
returning—e.g., convert the early returns into error returns or set a failure
flag and include the file path (rel/region) and the underlying error message in
the returned error/summary so permissions or malformed JSON are reported.

In `@medium.go`:
- Around line 160-204: The current importJSON/collectJSONRecords silently drop
unsupported JSON shapes or non-object array items; change the logic to fail fast
and return an error when encountering unsupported top-level shapes or non-object
array entries: update collectJSONRecords to report errors (e.g., return
([]map[string]any, error)) and detect these cases—unsupported top-level types,
arrays with non-map items, or unknown wrapper keys—then have importJSON
propagate that error instead of proceeding, so workspace.Put is only called when
valid records are returned.
- Around line 206-260: The current importCSV and splitCSVLine implementations
break on quoted multiline fields and silently ignore malformed CSV; replace the
hand-rolled parser by using encoding/csv: create a csv.Reader over the full
content (e.g., csv.NewReader(strings.NewReader(content))), set
Reader.FieldsPerRecord = -1 to allow variable columns, call Read to get the
header, then iterate with Read (or ReadAll) to get each record, map
header->field into the same record map and call workspace.Put(kind, record), and
return any csv.Reader errors (wrap with core.E like other errors) so unclosed
quotes or malformed CSV surface; remove or stop using splitCSVLine.

In `@publish.go`:
- Around line 73-80: The preflight allows an empty cfg.Repo during dry-run which
later causes ensureHFDatasetRepo to fail on real publish; validate cfg.Repo
earlier in Publish regardless of cfg.DryRun by checking cfg.Repo != "" and
returning an error via core.E (same pattern used for InputDir/token) if empty,
so the dry-run will fail fast and mirror real publish behavior; update the
validation near the top of Publish to reference cfg.Repo and ensureHFDatasetRepo
expectations.
- Around line 244-265: The uploadFileToHF function currently hardcodes
http.Client{Timeout: 120 * time.Second} which caps the whole transfer and can
abort large streamed uploads; change uploadFileToHF to accept a caller-supplied
context.Context (or an explicit timeout/dialer config) and create the request
with http.NewRequestWithContext(ctx, ...) and use an http.Client that does not
set a fixed Timeout (or reads a configurable timeout value) so the caller
controls cancellation; update call sites to pass context and ensure req
creation, client variable and any timeout configuration reference the new
ctx/timeout parameter instead of the fixed 120s value.
- Around line 5-13: The token fallback currently uses a nonstandard DIR_HOME
variable so the documented "~/.huggingface/token" is never read; update the
token discovery logic (where DIR_HOME is referenced and where Token and HF_TOKEN
fallbacks are handled) to resolve the real user home directory via
os.UserHomeDir() (with os.Getenv("HOME") as a secondary fallback) and construct
the path with filepath.Join(home, ".huggingface", "token"), then attempt to read
that file as the final fallback; replace any usages of DIR_HOME and adjust the
code in the token lookup block (also present around the other occurrence at the
same file region) to use this real-home resolution and file read.

In `@README.md`:
- Line 2: Update the README badge text from "License" to "Licence" and change
the link target from "LICENSE.md" to "LICENCE.md" so both label and link use UK
English; locate the markdown badge line (the [![Licence: EUPL-1.2](...)](...)
string) and update the visible label and URL, and if the repository file is
currently named LICENSE.md also rename that file to LICENCE.md (or add a
symlink) so the link resolves.

In `@store.go`:
- Around line 466-478: The Close path returns early when
storeInstance.sqliteDatabase is nil, skipping closing and sync; mirror the
db/sqliteDatabase backfill used by ensureReady by checking storeInstance.db when
sqliteDatabase is nil and assigning/using it so the connection will be closed
and syncMediumBackedDatabase will run; update the Close function (references:
storeInstance, sqliteDatabase, db, Close, syncMediumBackedDatabase,
orphanCleanupErr) to use the backfilled sqliteDatabase or the db equivalent
instead of returning early, then proceed with Close, syncMediumBackedDatabase,
and the orphan cleanup error handling as before.

In `@test_asserts_test.go`:
- Around line 189-209: The errIs helper currently only does == and Unwrap walks
and misses custom Is(error) bool logic and the nil==nil case; update the errIs
function to delegate to the standard library by using errors.Is(err, target) so
custom Is methods and the nil handling are respected, and add the "errors"
import if missing; replace the existing manual loop in errIs with this single
call (keeping the function name errIs) to fix matching behavior.

---

Outside diff comments:
In `@compact.go`:
- Around line 188-223: The deferred cleanup currently can delete the only staged
archive after transaction.Commit() succeeds, losing the payload if
medium.Rename(...) then fails; to fix, mark the staged artifact as preserved
before the point-of-no-return so the defer won’t delete it: set
stagedOutputPublished = true immediately after a successful transaction.Commit()
(and before calling medium.Rename), so on a subsequent rename failure the staged
file is left in place for recovery/retry and you return the error; keep the
existing rollback/defer logic for transaction and only delete stagedOutputPath
when stagedOutputPublished is false.

In `@workspace.go`:
- Around line 321-329: Commit() can report failure after the durable store write
(commitWorkspaceAggregate) succeeds because closeAndRemoveFiles() failed, making
commits non-idempotent and allowing duplicate aggregates on retry; fix by making
the commit atomic from the caller’s POV: have commitWorkspaceAggregate (or an
adjacent store call) write a durable "committed" sentinel/marker for
workspace.name before attempting closeAndRemoveFiles(), and treat
closeAndRemoveFiles() failures as cleanup-only (return OK:true with a logged
cleanup error) so callers don’t retry the aggregate; alternatively ensure
RecoverOrphans() checks that sentinel and skips re-publishing aggregates for
workspaces already marked committed; update Workspace.Commit(),
commitWorkspaceAggregate, closeAndRemoveFiles(), and RecoverOrphans()
accordingly so a successful store write is permanently considered committed even
if cleanup fails.

---

Nitpick comments:
In `@medium.go`:
- Line 18: Replace the option-chain usage in the public example with a
StoreConfig literal: instead of calling store.New(":memory:",
store.WithMedium(medium)), construct a StoreConfig (or ScopedStoreConfig if
relevant) with the Medium field set to the local medium instance and pass that
to store.New (e.g., create storeCfg := store.StoreConfig{Medium: medium} and
call store.New(":memory:", storeCfg) or the configured-constructor variant used
elsewhere); update the example line that currently references local.New and
store.New to demonstrate creating medium via local.New and then passing a
StoreConfig literal to store.New.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6afd44c2-2016-4121-ad05-a1c25984b54d

📥 Commits

Reviewing files that changed from the base of the PR and between 85ab185 and 6c90af8.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (35)
  • .golangci.yml
  • DEPENDENCIES.md
  • README.md
  • bench_test.go
  • compact.go
  • compact_test.go
  • conventions_test.go
  • coverage_test.go
  • doc.go
  • docs/architecture.md
  • duckdb.go
  • events.go
  • events_test.go
  • go.mod
  • import.go
  • import_export_test.go
  • journal.go
  • journal_test.go
  • json.go
  • medium.go
  • medium_test.go
  • parquet.go
  • path_test.go
  • publish.go
  • recover_test.go
  • scope.go
  • scope_test.go
  • store.go
  • store_test.go
  • test_asserts_test.go
  • tests/cli/store/Taskfile.yaml
  • transaction.go
  • transaction_test.go
  • workspace.go
  • workspace_test.go
✅ Files skipped from review due to trivial changes (4)
  • DEPENDENCIES.md
  • docs/architecture.md
  • tests/cli/store/Taskfile.yaml
  • bench_test.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • path_test.go
  • doc.go
  • medium_test.go
  • journal_test.go
  • events.go
  • events_test.go
  • scope.go
  • coverage_test.go

Comment thread duckdb.go
Comment thread import.go
Comment thread import.go
Comment thread medium.go
Comment thread medium.go Outdated
Comment thread publish.go
Comment thread publish.go Outdated
Comment thread README.md Outdated
Comment thread store.go
Comment thread test_asserts_test.go Outdated
Round 2 follow-up to 6c90af8. CodeRabbit re-reviewed and surfaced
14 residual issues; all dispositioned.

Code:
- compact.go: staged archive preserved after successful DB commit
  (was being deleted prematurely)
- workspace.go: commit idempotency — recovery skips/removes leftover
  files when durable summary marker exists; cleanup failure no longer
  fails Commit() after durable write
- medium.go: StoreConfig public example; JSON import fails fast on
  unsupported/non-object records; CSV parser switched from hand-roll
  to encoding/csv with multiline + malformed handling
- import.go: removed /tmp seed fallbacks (deterministic dirs); read +
  JSON parse failures now return contextual errors
- publish.go: HuggingFace token uses real HOME via core.Env (not
  DIR_HOME); empty Repo validated before dry-run; upload uses
  caller-configurable PublishConfig.Context (no fixed http timeout)
- store.go: Close() backfills db/sqliteDatabase aliases before
  closing/syncing
- test_asserts_test.go: errIs delegates to core.Is (AX import rules)

CI / docs:
- .github/workflows/ci.yml: CGO_ENABLED=1 explicit (DuckDB requires CGO)
- DEPENDENCIES.md: required toolchain documented for DuckDB context
- README.md: Licence badge UK English + LICENCE.md link
- LICENCE.md (new file)
- publish_test.go (new) — covers HOME / dry-run / config-context paths

Disposition replies:
- Testify reintroduction suggestion: RESOLVED-COMMENT — AX-6 bans testify
- SonarCloud: no PR comments/check annotations exposed; RESOLVED-COMMENT

Verification: gofmt clean, golangci-lint run 0 issues, GOWORK=off
go vet + go test -count=1 ./... pass with explicit cache paths.

Closes residual findings on #4

Co-authored-by: Codex <noreply@openai.com>
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

95-97: ⚠️ Potential issue | 🟡 Minor

Fix the footer licence link target.

Line 97 links to LICENCE, but the README now points to LICENCE.md elsewhere. If the file is LICENCE.md, this footer link is broken for readers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 95 - 97, The README footer currently links to
"LICENCE" which is broken; update the link target in the "## Licence" section so
it points to "LICENCE.md" (replace the string 'LICENCE' with 'LICENCE.md') to
match the repository file name and restore the correct link.
♻️ Duplicate comments (1)
publish.go (1)

144-161: ⚠️ Potential issue | 🟠 Major

~/.huggingface/token is still tied to HOME only.

Line 153 still assumes HOME is set, so the documented file fallback can disappear on perfectly normal environments where the home directory exists but HOME does not. Prefer os.UserHomeDir() here and keep HOME only as a fallback.

In Go, is relying only on the HOME environment variable portable for locating the current user's home directory, or should os.UserHomeDir() be used instead?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@publish.go` around lines 144 - 161, The resolveHFToken function currently
uses HOME env only to read ~/.huggingface/token; replace that logic to call
os.UserHomeDir() to get the user's home directory and fall back to
core.Env("HOME") only if UserHomeDir returns an error or empty, then continue
reading localFs.Read(core.JoinPath(home, ".huggingface", "token")) as before;
update imports to include os (if needed) and ensure behavior when both methods
fail remains returning "".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@import.go`:
- Around line 358-360: The code currently swallows JSON unmarshal failures
(e.g., the block using core.JSONUnmarshal(scanner.Bytes(), &rec) that does "if r
:= ...; !r.OK { continue }"), causing malformed JSONL lines to be ignored;
change this to record and surface parse failures: increment a local failure
counter, capture the scanner.Bytes()/line number and wrap the original parse
error with file and line context, log or append that wrapped error to an errors
slice, and after scanning return a non-nil error (or aggregate error) if any
failures occurred; apply the same fix to the other similar sites you mentioned
(the blocks around lines 404-406 and 442-444) so malformed records are not
silently dropped and the caller receives meaningful failure information.
- Around line 79-339: ImportAll currently drops and recreates tables
incrementally which can leave the DB half-imported on error; wrap the full
import in a single transactional boundary (start a transaction via db.Begin or
BeginTx at the start of ImportAll, use tx.Exec/tx.QueryRowScan equivalents for
all table DDL/DML operations performed in ImportAll including creation of
golden_set, training_examples, benchmark_results, benchmark_questions, seeds and
all import* helper calls, and ensure any error triggers tx.Rollback and the
function returns, while successful completion calls tx.Commit); alternatively
implement a temp-table swap strategy where you create temp tables (e.g.
golden_set_tmp, training_examples_tmp, etc.) during import and only DROP/RENAME
them into place after all imports succeed — reference ImportAll, the db.Exec
calls that CREATE/DROP tables, and the import* helpers (importTrainingFile,
importBenchmarkFile, importBenchmarkQuestions, importSeeds) to locate all places
needing to use the transaction or temp tables.
- Around line 545-563: walkDir currently swallows failures from localFs.List and
bad type assertions, so unreadable seed subdirectories are ignored; change
walkDir to return an error (e.g., func walkDir(root string, fn func(path string)
error) error), have it check r.OK and return a wrapped error when localFs.List
fails, return an error when the type assertion on r.Value fails, propagate
errors from recursive walkDir(full, fn) and from fn(full) back up, and update
importSeeds to call the new walkDir signature and return the propagated error
instead of reporting success on traversal failures.

In `@medium.go`:
- Around line 245-251: The JSON export currently calls workspace.Aggregate(),
which masks workspace failures by returning {}; replace that with a call that
surfaces errors (e.g., workspace.aggregateFields() or a readiness check) and
propagate any error instead of proceeding: call workspace.aggregateFields() (or
workspace.Ready()/Aggregate with an explicit error-check) from exportJSON,
return core.E(...) when that call returns an error, and only JSONMarshalString
and medium.Write when the aggregate call succeeds so exportJSON fails rather
than silently writing an empty object.

In `@publish.go`:
- Around line 100-106: collectUploadFiles currently includes the optional
dataset_card.md so the existing check against files (the variable returned by
collectUploadFiles) can be true even when no Parquet splits exist; change the
logic to detect Parquet splits explicitly: update collectUploadFiles (or add a
new helper like collectParquetSplits) to return whether any *.parquet split
files were found (e.g., return a separate splits slice or a bool foundParquet
alongside files), then replace the current "len(files) == 0" checks in
publish.Publish (and the other similar check around the later publish flow) with
a check that fails unless at least one Parquet split was found, leaving
dataset_card.md allowed but not counted as a split.

In `@store.go`:
- Around line 967-985: The purge currently snapshots expiredEntries via
listExpiredEntriesMatchingGroupPrefix then separately calls
purgeExpiredMatchingGroupPrefix, risking races; change PurgeExpired to perform
the select+delete atomically and derive notifications from the rows actually
deleted — either by wrapping the select and delete in a single transaction (use
the same sqliteDatabase tx and call listExpiredEntriesMatchingGroupPrefix and
purgeExpiredMatchingGroupPrefix against that tx) or by merging into a DELETE ...
RETURNING (preferred) that returns group/key rows and using that result to drive
storeInstance.notify(EventDelete) so notifications exactly match deleted rows
(update references to listExpiredEntriesMatchingGroupPrefix,
purgeExpiredMatchingGroupPrefix, PurgeExpired, and storeInstance.notify
accordingly).
- Around line 426-432: The Close() flow currently sets storeInstance.isClosed
under lifecycleLock before calling syncMediumBackedDatabase(), which prevents
retries if the sync fails; change the logic in Close() (and the similar block
used later) to only set storeInstance.isClosed after syncMediumBackedDatabase()
completes successfully, or alternatively introduce a distinct state (e.g.,
isClosing or syncPending) that remains set when syncMediumBackedDatabase() fails
so subsequent Close() calls will retry the sync; ensure lifecycleLock still
protects checks/updates to isClosed/isClosing and call
syncMediumBackedDatabase() outside the lock or while holding a closing flag to
avoid races.

---

Outside diff comments:
In `@README.md`:
- Around line 95-97: The README footer currently links to "LICENCE" which is
broken; update the link target in the "## Licence" section so it points to
"LICENCE.md" (replace the string 'LICENCE' with 'LICENCE.md') to match the
repository file name and restore the correct link.

---

Duplicate comments:
In `@publish.go`:
- Around line 144-161: The resolveHFToken function currently uses HOME env only
to read ~/.huggingface/token; replace that logic to call os.UserHomeDir() to get
the user's home directory and fall back to core.Env("HOME") only if UserHomeDir
returns an error or empty, then continue reading
localFs.Read(core.JoinPath(home, ".huggingface", "token")) as before; update
imports to include os (if needed) and ensure behavior when both methods fail
remains returning "".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73015361-e169-4081-8b52-dd74e57b9a84

📥 Commits

Reviewing files that changed from the base of the PR and between 6c90af8 and ebe5377.

📒 Files selected for processing (15)
  • .github/workflows/ci.yml
  • DEPENDENCIES.md
  • LICENCE.md
  • README.md
  • compact.go
  • import.go
  • medium.go
  • medium_test.go
  • publish.go
  • publish_test.go
  • store.go
  • store_test.go
  • test_asserts_test.go
  • workspace.go
  • workspace_test.go
✅ Files skipped from review due to trivial changes (3)
  • LICENCE.md
  • .github/workflows/ci.yml
  • DEPENDENCIES.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • store_test.go

Comment thread import.go
Comment thread import.go
Comment thread import.go
Comment thread medium.go
Comment thread publish.go Outdated
Comment thread store.go
Comment thread store.go
… order on PR #4

Round 3 follow-up to ebe5377. Closes residual CodeRabbit findings.

Code:
- import.go: ImportAll DB mutations wrapped in transaction with
  rollback-on-error
- import.go: malformed JSONL returns file/line parse errors in all
  three import helpers (was silently swallowing per-line errors)
- import.go: walkDir returns + propagates traversal/list/type errors
- medium.go: JSON export uses aggregateFields() + propagates
  workspace failures
- publish.go: dataset_card.md excluded from Parquet split count
- store.go: medium-backed Close() remains retryable after sync
  failure; operations see closing state as closed
- store.go + scope.go + transaction.go: purge uses
  DELETE ... RETURNING so notifications come from rows actually
  deleted (was reading first then deleting separately)
- publish.go: token lookup uses Core's DIR_HOME (populated via
  os.UserHomeDir) then falls back to HOME — preserves direct-os
  import ban while picking up real home

Tests:
- import_test.go (new): coverage of transactional import +
  malformed-JSONL error path

Doc:
- README.md: footer licence link targets LICENCE.md (UK English)

Verification: gofmt clean, golangci-lint v2 0 issues, GOWORK=off
go vet + go test -count=1 ./... pass with explicit cache paths.

Closes residual findings on #4

Co-authored-by: Codex <noreply@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
import.go (1)

165-183: Prefer descriptive loop variable names over abbreviations.

Names like td, rel, jf, and bfile reduce readability in an already large orchestration function. Consider expanding these to self-describing identifiers.

As per coding guidelines: **/*.go: Prefer descriptive names over abbreviations.

Also applies to: 189-191, 277-283, 291-301

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@import.go` around lines 165 - 183, The loops iterating over trainingDirs and
other slices use short, unclear variable names (e.g., td, rel, jf, bfile) which
hurt readability; update those loop variable names in import.go to
self-describing identifiers (for example: trainingDir, relatedPath, jsonFile,
baseFile or similar) wherever they appear around the trainingDirs block and the
other referenced loops so the purpose of each variable is explicit (search for
uses of trainingDirs and the loop blocks later in the file to find all
occurrences to rename).
store_test.go (1)

417-497: Stop swallowing fixture-setup errors in these new existence tests.

Several of the added cases still do storeInstance, _ := New(...) and ignore Set/DeleteGroup failures. If the fixture creation breaks, these tests can panic or assert on the wrong behaviour instead of failing at the setup step. Please keep the new coverage on the same assertNoError path already used elsewhere in this file.

Suggested pattern
 func TestStore_Exists_Good_Present(t *testing.T) {
-	storeInstance, _ := New(":memory:")
+	storeInstance, err := New(":memory:")
+	assertNoError(t, err)
 	defer func() { _ = storeInstance.Close() }()
 
-	_ = storeInstance.Set("config", "colour", "blue")
+	assertNoError(t, storeInstance.Set("config", "colour", "blue"))
 
-	exists, err := storeInstance.Exists("config", "colour")
+	exists, err := storeInstance.Exists("config", "colour")
 	assertNoError(t, err)
 	assertTrue(t, exists)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_test.go` around lines 417 - 497, The tests are ignoring setup errors
(using storeInstance, _ := New(...), and discarding errors from
Set/SetWithTTL/DeleteGroup); update each test (e.g.,
TestStore_Exists_Good_Present, TestStore_Exists_Good_Absent,
TestStore_Exists_Good_ExpiredKeyReturnsFalse, TestStore_GroupExists_* and
similar) to capture the returned error from New and from
Set/SetWithTTL/DeleteGroup, call assertNoError(t, err) immediately after each
setup call, and only proceed if setup succeeded; reference the New, Set,
SetWithTTL, DeleteGroup and Close calls to locate where to add the error
variables and assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@import.go`:
- Around line 245-246: The call sites that ignore localFs.EnsureDir results
(e.g., the localFs.EnsureDir(benchLocal) invocations) must check and handle the
returned error instead of discarding it; update each EnsureDir call to capture
the error, log or wrap it with context (including the target path like
benchLocal), and return the error (or propagate it up) so SCP/import fails
visibly on directory creation failure rather than continuing silently and
producing misleading totals.

In `@publish_test.go`:
- Around line 31-40: The test TestPublish_ResolveHFToken_Good_UserHomeFallback
is flaky because resolveHFToken prefers DIR_HOME over HOME; ensure the test
clears DIR_HOME so the HOME fallback is exercised by adding an environment
unset/clear for DIR_HOME (e.g., use t.Setenv("DIR_HOME", "") or otherwise clear
DIR_HOME at the start of the test) before setting HOME and calling
resolveHFToken; this change should be applied inside that test function so
resolveHFToken reads from HOME's .huggingface/token as intended.

In `@scope.go`:
- Around line 181-184: The readiness check is using the generic store-level
method name; update the calls to ensureReady to use scoped API method names so
errors reference the actual failing wrapper. Replace ensureReady("store.Exists")
with ensureReady("store.Scoped.Exists") (and similarly use
"store.Scoped.GroupExists" for the GroupExists wrapper and the analogous names
for the other scoped wrappers at the mentioned locations) in the scopedStore
methods (scopedStore.Exists, scopedStore.GroupExists, etc.) so the readiness
error string identifies the scoped API that failed.
- Around line 157-160: ScopedStore.Namespace() should guard against a nil
receiver to avoid panics when NewScoped() returned nil; update the Namespace
method (on type ScopedStore) to check if scopedStore == nil and return the
zero-value (empty string) in that case, mirroring the behavior of the other
public accessors, so callers can safely call Namespace() without nil checks even
when NewScoped() returned nil.

In `@store_test.go`:
- Around line 1639-1647: The direct receive from the events channel can hang the
test; change the blocking "event := <-events" to a bounded select that waits for
the purge event from events or times out (e.g., use time.After with a short
deadline) and fail the test on timeout, keeping the existing assertions on
event.Type/Group/Key; leave the subsequent non-blocking select that checks for
unexpected extra events as-is. Target the test usage of the events channel in
store_test.go (the spot that calls PurgeExpired() and reads from events) and
ensure the test imports/uses time for the timeout.

---

Nitpick comments:
In `@import.go`:
- Around line 165-183: The loops iterating over trainingDirs and other slices
use short, unclear variable names (e.g., td, rel, jf, bfile) which hurt
readability; update those loop variable names in import.go to self-describing
identifiers (for example: trainingDir, relatedPath, jsonFile, baseFile or
similar) wherever they appear around the trainingDirs block and the other
referenced loops so the purpose of each variable is explicit (search for uses of
trainingDirs and the loop blocks later in the file to find all occurrences to
rename).

In `@store_test.go`:
- Around line 417-497: The tests are ignoring setup errors (using storeInstance,
_ := New(...), and discarding errors from Set/SetWithTTL/DeleteGroup); update
each test (e.g., TestStore_Exists_Good_Present, TestStore_Exists_Good_Absent,
TestStore_Exists_Good_ExpiredKeyReturnsFalse, TestStore_GroupExists_* and
similar) to capture the returned error from New and from
Set/SetWithTTL/DeleteGroup, call assertNoError(t, err) immediately after each
setup call, and only proceed if setup succeeded; reference the New, Set,
SetWithTTL, DeleteGroup and Close calls to locate where to add the error
variables and assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd735237-3848-4074-89bf-bdec9df055a1

📥 Commits

Reviewing files that changed from the base of the PR and between ebe5377 and fc77445.

📒 Files selected for processing (12)
  • README.md
  • events.go
  • import.go
  • import_test.go
  • medium.go
  • medium_test.go
  • publish.go
  • publish_test.go
  • scope.go
  • store.go
  • store_test.go
  • transaction.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • medium_test.go

Comment thread import.go Outdated
Comment thread publish_test.go
Comment thread scope.go
Comment thread scope.go Outdated
Comment thread store_test.go Outdated
…ent timeout on PR #4

Round 4 follow-up to fc77445.

Code:
- import.go: benchmark directory/subdirectory creation now checked
  with contextual errors (was silently failing on EnsureDir)
- import.go: terse loop variables expanded in reviewed loops
- scope.go: nil Namespace() guarded
- scope.go: scoped readiness uses 'store.ScopedStore.*' operation
  names across wrappers (matching test updated)

Tests:
- publish_test.go: HOME fallback test clears DIR_HOME (was flaky
  due to env leakage)
- store_test.go: purge event wait now uses bounded select with
  timeout (was hanging on missing event)
- store_test.go: Exists/GroupExists tests no longer swallow fixture
  setup errors

Verification: gofmt clean, golangci-lint v2 0 issues, GOWORK=off
go vet + go test -count=1 ./... pass with explicit cache paths.

Closes residual r4 findings on #4

Co-authored-by: Codex <noreply@openai.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
store_test.go (1)

1512-1516: ⚠️ Potential issue | 🟡 Minor

Bound the first event read so this test cannot wedge.

The blocking receive on Line 1516 can still hang the suite if the initial write event is missed. Mirror the bounded select used later in the same test before moving on to the expiry assertions.

⏱️ Suggested change
 	events := storeInstance.Watch("g")
 	defer storeInstance.Unwatch("g", events)

 	assertNoError(t, storeInstance.SetWithTTL("g", "ephemeral", "gone-soon", 1*time.Millisecond))
-	<-events
+	select {
+	case <-events:
+	case <-time.After(time.Second):
+		t.Fatal("timed out waiting for initial TTL set event")
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_test.go` around lines 1512 - 1516, The test currently blocks on an
unbounded receive from the events channel returned by storeInstance.Watch("g"),
which can hang if the initial write event is missed; replace the plain
`<-events` with a bounded select that waits for the events channel or a timeout
(use time.After with a short duration) and fail the test on timeout so the suite
cannot wedge; update the block around storeInstance.Watch /
storeInstance.Unwatch and the SetWithTTL("g", ...) write to use this bounded
select pattern.
🧹 Nitpick comments (3)
import.go (3)

638-648: Consider using strings.Repeat from the standard library.

The repeat function reimplements functionality available in the standard library.

♻️ Suggested simplification
+import "strings"

-// repeat returns a string consisting of count copies of s.
-func repeat(s string, count int) string {
-	if count <= 0 {
-		return ""
-	}
-	b := core.NewBuilder()
-	for range count {
-		b.WriteString(s)
-	}
-	return b.String()
-}
+// repeat returns a string consisting of count copies of s.
+func repeat(s string, count int) string {
+	return strings.Repeat(s, count)
+}

Note: strings.Repeat already handles count <= 0 by returning an empty string.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@import.go` around lines 638 - 648, The custom repeat function currently
reimplements functionality from the standard library; replace the body of the
repeat function (named repeat) by delegating to strings.Repeat(s, count) and
remove the manual builder loop and core.NewBuilder usage; ensure you import the
"strings" package and keep behavior for count <= 0 (strings.Repeat already
handles that).

34-36: Inconsistent error wrapping compared to exec method.

The exec method wraps errors with core.E, but queryRowScan returns the raw Scan error. Consider wrapping for consistency, though the caller does wrap at the call site.

♻️ Suggested consistency fix
 func (session duckDBImportTransaction) queryRowScan(query string, dest any, args ...any) error {
-	return session.transaction.QueryRow(query, args...).Scan(dest)
+	if err := session.transaction.QueryRow(query, args...).Scan(dest); err != nil {
+		return core.E("store.duckDBImportTransaction.QueryRowScan", "scan row", err)
+	}
+	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@import.go` around lines 34 - 36, The queryRowScan method returns the raw Scan
error which is inconsistent with exec that wraps errors with core.E; update
duckDBImportTransaction.queryRowScan to wrap returned errors using core.E
(matching the pattern used by exec), providing a descriptive context message
(e.g., "queryRowScan failed" or the query text) and include the original error
so callers receive a wrapped error consistently.

101-101: Optional: Consider extracting import phases into helper functions.

Static analysis flags high cognitive complexity (98 vs 15 allowed). The sequential phases (golden set, training, benchmarks, benchmark questions, seeds) could each become a helper function accepting the importSession and cfg, improving readability and testability. This is a recommended refactor for maintainability but not blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@import.go` at line 101, The ImportAll function is too complex; extract each
sequential import phase into small helper functions to reduce cognitive
complexity: create functions like importGoldenSet(session *importSession, cfg
ImportConfig), importTrainingData(session *importSession, cfg ImportConfig),
importBenchmarks(session *importSession, cfg ImportConfig),
importBenchmarkQuestions(session *importSession, cfg ImportConfig), and
importSeeds(session *importSession, cfg ImportConfig) and move the corresponding
logic from ImportAll into these helpers; then have ImportAll orchestrate by
creating the importSession and calling those helpers in order, passing the same
importSession and cfg, and ensure error propagation and any resource cleanup
(currently in ImportAll) remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coverage_test.go`:
- Around line 93-100: Replace swallowed Close() calls with explicit error
checks: instead of discarding the error from storeInstance.Close() and
rawDatabase.Close(), capture their return values and call assertNoError(t, err)
(or the project's test helper) so test fails immediately if Close() returns an
error; update the two occurrences referenced (the storeInstance.Close() call and
the rawDatabase.Close() call in the corruption/checkpoint block, and mirror the
same change in the other test at lines 180-186) to prevent running
checkpoint/corruption steps against a live connection.

In `@scope.go`:
- Around line 231-234: The quota check in ScopedStore.Set currently runs
separately from the subsequent write (scopedStore.checkQuota(...) then
scopedStore.store.Set(...)), which allows concurrent writers to bypass
MaxKeys/MaxGroups; change Set (and similarly SetIn and SetWithTTL) to perform
the quota check and the write atomically by using a single transactional/atomic
operation or a per-group lock around the check+write: e.g., acquire the
ScopedStore's group-level lock (or invoke the underlying store's
transaction/compare-and-set API) before calling checkQuota and then perform
scopedStore.store.Set(scopedStore.namespacedGroup(...), key, value) while
holding the lock/transaction, releasing it afterward so concurrent callers
cannot interleave and exceed quotas.

---

Duplicate comments:
In `@store_test.go`:
- Around line 1512-1516: The test currently blocks on an unbounded receive from
the events channel returned by storeInstance.Watch("g"), which can hang if the
initial write event is missed; replace the plain `<-events` with a bounded
select that waits for the events channel or a timeout (use time.After with a
short duration) and fail the test on timeout so the suite cannot wedge; update
the block around storeInstance.Watch / storeInstance.Unwatch and the
SetWithTTL("g", ...) write to use this bounded select pattern.

---

Nitpick comments:
In `@import.go`:
- Around line 638-648: The custom repeat function currently reimplements
functionality from the standard library; replace the body of the repeat function
(named repeat) by delegating to strings.Repeat(s, count) and remove the manual
builder loop and core.NewBuilder usage; ensure you import the "strings" package
and keep behavior for count <= 0 (strings.Repeat already handles that).
- Around line 34-36: The queryRowScan method returns the raw Scan error which is
inconsistent with exec that wraps errors with core.E; update
duckDBImportTransaction.queryRowScan to wrap returned errors using core.E
(matching the pattern used by exec), providing a descriptive context message
(e.g., "queryRowScan failed" or the query text) and include the original error
so callers receive a wrapped error consistently.
- Line 101: The ImportAll function is too complex; extract each sequential
import phase into small helper functions to reduce cognitive complexity: create
functions like importGoldenSet(session *importSession, cfg ImportConfig),
importTrainingData(session *importSession, cfg ImportConfig),
importBenchmarks(session *importSession, cfg ImportConfig),
importBenchmarkQuestions(session *importSession, cfg ImportConfig), and
importSeeds(session *importSession, cfg ImportConfig) and move the corresponding
logic from ImportAll into these helpers; then have ImportAll orchestrate by
creating the importSession and calling those helpers in order, passing the same
importSession and cfg, and ensure error propagation and any resource cleanup
(currently in ImportAll) remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8352d0ea-f7aa-4663-903a-83d9e2ff7119

📥 Commits

Reviewing files that changed from the base of the PR and between fc77445 and 4aeddac.

📒 Files selected for processing (5)
  • coverage_test.go
  • import.go
  • publish_test.go
  • scope.go
  • store_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • publish_test.go

Comment thread coverage_test.go Outdated
Comment thread scope.go Outdated
…on PR #4

Round 5 follow-up to 4aeddac.

Code:
- scope.go: scoped quota race fixed — Set/SetIn/SetWithTTL route
  through ScopedStore.Transaction (atomic). Non-transactional quota
  helper removed.
- coverage_test.go: corruption setup tests no longer swallow Close()
  errors
- store_test.go: duplicate unbounded event receive replaced with
  bounded select + timeout (was a hang risk)
- import.go: queryRowScan errors wrapped with method context
- import.go: documented strings-import ban via inline comment

Verification: gofmt clean, golangci-lint v2 0 issues, go vet + go
test -count=1 ./... pass.

Closes residual r5 findings on #4

Co-authored-by: Codex <noreply@openai.com>
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
8 Security Hotspots

See analysis details on SonarQube Cloud

@Snider Snider merged commit fe3a4d2 into main Apr 27, 2026
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants