Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
2b591c4
feat(dag/walker): add VisitedTracker with BloomTracker and MapTracker
lidel Mar 20, 2026
224c2ae
feat(dag/walker): add WalkDAG with codec-agnostic link extraction
lidel Mar 21, 2026
8c8f5f6
feat(dag/walker): add WalkEntityRoots for entity-aware traversal
lidel Mar 21, 2026
99495ec
test(dag/walker): add BloomTracker FP rate regression tests
lidel Mar 21, 2026
53d7674
fix(provider): stream error continues to next, add NewConcatProvider
lidel Mar 21, 2026
e38f023
feat(pinner): add NewUniquePinnedProvider and NewPinnedEntityRootsPro…
lidel Mar 21, 2026
7b8f853
test: add PrioritizedProvider error-continue regression test
lidel Mar 21, 2026
685c82e
refactor(provider): use labeled break in NewConcatProvider for consis…
lidel Mar 21, 2026
b75fa03
refactor(dag/walker): extract shared linkSystemForBlockstore helper
lidel Mar 21, 2026
5147959
fix(dag/walker): skip emitting identity CIDs, add tests
lidel Mar 21, 2026
609ff3d
test(dag/walker): add symlink entity detection tests
lidel Mar 21, 2026
8cfa9a0
refactor: consolidate identity CID tests, filter direct pins
lidel Mar 21, 2026
11cd29e
fix(dag/walker): visit siblings in left-to-right link order
lidel Mar 21, 2026
44bb0a3
fix(pinner): continue on pin iteration error in unique providers
lidel Mar 21, 2026
56a0a31
docs(dag/walker): document implicit behaviors
lidel Mar 21, 2026
dcfda13
Merge branch 'main' into feat/provide-entity-roots-with-dedup
lidel Mar 23, 2026
14c5f91
Merge remote-tracking branch 'origin/main' into feat/provide-entity-r…
lidel Mar 25, 2026
0fc1a0b
fix: address review feedback from gammazero
lidel Mar 29, 2026
577fa3f
feat(walker): log bloom tracker creation and autoscaling
lidel Mar 29, 2026
c8920a2
feat(walker): add Deduplicated() to BloomTracker and MapTracker
lidel Mar 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@ The following emojis are used to highlight certain changes:

### Added

- `dag/walker`: new package for memory-efficient DAG traversal with deduplication. `VisitedTracker` interface with `BloomTracker` (scalable bloom filter chain, ~4 bytes/CID vs ~75 bytes for a map) and `MapTracker` (exact, for tests). `WalkDAG` provides iterative DFS traversal with integrated dedup, supporting dag-pb, dag-cbor, raw, and other registered codecs. ~2x faster than the legacy go-ipld-prime selector-based traversal. `WalkEntityRoots` emits only entity roots (files, directories, HAMT shards) instead of every block, skipping internal file chunks. [#1124](https://github.com/ipfs/boxo/pull/1124)
- `routing/http/client`: `WithProviderInfoFunc` option resolves provider addresses at provide-time instead of client construction time. This only impacts legacy HTTP-only custom routing setups that depend on [IPIP-526](https://github.com/ipfs/specs/pull/526) and were sending unresolved `0.0.0.0` addresses in provider records instead of actual interface addresses. [#1115](https://github.com/ipfs/boxo/pull/1115)
- `chunker`: added `Register` function to allow custom chunkers to be registered for use with `FromString`.

### Changed

- `provider`: `NewPrioritizedProvider` now continues to the next stream when one fails instead of stopping all streams. `NewConcatProvider` added for pre-deduplicated streams. [#1124](https://github.com/ipfs/boxo/pull/1124)
- `chunker`: `FromString` now rejects malformed `size-` strings with extra parameters (e.g. `size-123-extra` was previously silently accepted).
- upgrade to `go-libp2p` [v0.48.0](https://github.com/libp2p/go-libp2p/releases/tag/v0.48.0)

Expand All @@ -32,6 +34,7 @@ The following emojis are used to highlight certain changes:

### Fixed

- `pinner`: `NewUniquePinnedProvider` and `NewPinnedEntityRootsProvider` now log and skip corrupted pin entries instead of aborting the entire provide cycle, allowing remaining pins to still be provided. [#1124](https://github.com/ipfs/boxo/pull/1124)
- `bitswap/server`: incoming identity CIDs in wantlist messages are now silently ignored instead of killing the connection to the remote peer. Some IPFS implementations naively send identity CIDs, and disconnecting them for it caused unnecessary churn. [#1117](https://github.com/ipfs/boxo/pull/1117)
- `bitswap/network`: `ExtractHTTPAddress` now infers default ports for portless HTTP multiaddrs (e.g. `/dns/host/https` without `/tcp/443`). [#1123](https://github.com/ipfs/boxo/pull/1123)

Expand Down
21 changes: 21 additions & 0 deletions dag/walker/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
// Package walker provides memory-efficient DAG traversal with
// deduplication. Optimized for the IPFS provide system, but useful
// anywhere repeated DAG walks need to skip already-visited subtrees.
//
// The primary entry point is [WalkDAG], which walks a DAG rooted at a
// given CID, emitting each visited CID to a callback. When combined
// with a [VisitedTracker] (e.g. [BloomTracker]), entire subtrees
// already seen are skipped in O(1).
//
// For entity-aware traversal that only emits file/directory/HAMT roots
// instead of every block, see [WalkEntityRoots].
//
// Blocks are decoded using the codecs registered in the process via
// the global multicodec registry. In a standard kubo build this
// includes dag-pb, dag-cbor, dag-json, cbor, json, and raw.
//
// Use [LinksFetcherFromBlockstore] to create a fetcher backed by a
// local blockstore. For custom link extraction (e.g. a different codec
// registry or non-blockstore storage), pass your own [LinksFetcher]
// function directly to [WalkDAG].
package walker
192 changes: 192 additions & 0 deletions dag/walker/entity.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
package walker

import (
"context"
"slices"

blockstore "github.com/ipfs/boxo/blockstore"
"github.com/ipfs/boxo/ipld/unixfs"
cid "github.com/ipfs/go-cid"
ipld "github.com/ipld/go-ipld-prime"
cidlink "github.com/ipld/go-ipld-prime/linking/cid"
basicnode "github.com/ipld/go-ipld-prime/node/basic"
mh "github.com/multiformats/go-multihash"
)

// EntityType represents the semantic type of a DAG entity.
type EntityType int

const (
EntityUnknown EntityType = iota
EntityFile // UnixFS file root (not its chunks)
EntityDirectory // UnixFS flat directory
EntityHAMTShard // UnixFS HAMT sharded directory bucket
EntitySymlink // UnixFS symbolic link
)

// NodeFetcher returns child link CIDs and entity type for a given CID.
// Used by [WalkEntityRoots] which needs UnixFS type detection to decide
// whether to descend into children (directories, HAMT shards) or stop
// (files, symlinks).
type NodeFetcher func(ctx context.Context, c cid.Cid) (linkCIDs []cid.Cid, entityType EntityType, err error)

// NodeFetcherFromBlockstore creates a [NodeFetcher] backed by a local
// blockstore. Like [LinksFetcherFromBlockstore], it decodes blocks via
// ipld-prime's global multicodec registry (dag-pb, dag-cbor, raw, etc.)
// and handles identity CIDs transparently via [blockstore.NewIdStore].
//
// Entity type detection:
// - dag-pb with valid UnixFS Data: file, directory, HAMT shard, or symlink
// - dag-pb without valid UnixFS Data: EntityUnknown
// - raw codec: EntityFile (small file stored as a single raw block)
// - all other codecs (dag-cbor, dag-json, etc.): EntityUnknown
func NodeFetcherFromBlockstore(bs blockstore.Blockstore) NodeFetcher {
ls := linkSystemForBlockstore(bs)

return func(ctx context.Context, c cid.Cid) ([]cid.Cid, EntityType, error) {
lnk := cidlink.Link{Cid: c}
nd, err := ls.Load(ipld.LinkContext{Ctx: ctx}, lnk, basicnode.Prototype.Any)
if err != nil {
return nil, EntityUnknown, err
}

links := collectLinks(c, nd)
entityType := detectEntityType(c, nd)
return links, entityType, nil
}
}

// detectEntityType infers the UnixFS entity type from an ipld-prime
// decoded node. For dag-pb nodes, it reads the "Data" field and parses
// it as UnixFS protobuf. For raw codec nodes, it returns EntityFile.
// For everything else, it returns EntityUnknown.
func detectEntityType(c cid.Cid, nd ipld.Node) EntityType {
codec := c.Prefix().Codec

// raw codec: small file stored as a single block
if codec == cid.Raw {
return EntityFile
}

// only dag-pb has UnixFS semantics; other codecs are unknown
if codec != cid.DagProtobuf {
return EntityUnknown
}

// dag-pb: try to read the "Data" field for UnixFS type
dataField, err := nd.LookupByString("Data")
if err != nil || dataField.IsAbsent() || dataField.IsNull() {
return EntityUnknown
}

dataBytes, err := dataField.AsBytes()
if err != nil {
return EntityUnknown
}

fsn, err := unixfs.FSNodeFromBytes(dataBytes)
if err != nil {
return EntityUnknown
}

switch fsn.Type() {
case unixfs.TFile, unixfs.TRaw:
return EntityFile
case unixfs.TDirectory:
return EntityDirectory
case unixfs.THAMTShard:
return EntityHAMTShard
case unixfs.TSymlink:
return EntitySymlink
default:
return EntityUnknown
}
}

// WalkEntityRoots traverses a DAG calling emit for each entity root.
//
// Entity roots are semantic boundaries in the DAG:
// - File/symlink roots: emitted, children (chunks) NOT traversed
// - Directory roots: emitted, children recursed
// - HAMT shard nodes: emitted (needed for directory enumeration),
// children recursed
// - Non-UnixFS nodes (dag-cbor, dag-json, etc.): emitted AND children
// recursed to discover further content. The +entities optimization
// (skip chunks) only applies to UnixFS files; for all other codecs,
// every reachable CID is emitted.
// - Raw leaf nodes: emitted (no children to recurse)
//
// Same traversal order as [WalkDAG]: pre-order DFS with left-to-right
// sibling visiting. Uses the same option types: [WithVisitedTracker]
// for bloom/map dedup across walks, [WithLocality] for MFS locality
// checks.
func WalkEntityRoots(
ctx context.Context,
root cid.Cid,
fetch NodeFetcher,
emit func(cid.Cid) bool,
opts ...Option,
) error {
cfg := &walkConfig{}
for _, o := range opts {
o(cfg)
}

stack := []cid.Cid{root}

for len(stack) > 0 {
if ctx.Err() != nil {
return ctx.Err()
}

// pop
c := stack[len(stack)-1]
stack = stack[:len(stack)-1]

// dedup via tracker
if cfg.tracker != nil && !cfg.tracker.Visit(c) {
continue
}

// locality check
if cfg.locality != nil {
local, err := cfg.locality(ctx, c)
if err != nil {
log.Errorf("entity walk: locality check %s: %s", c, err)
continue
}
if !local {
continue
}
}

// fetch block and detect entity type
children, entityType, err := fetch(ctx, c)
if err != nil {
log.Errorf("entity walk: fetch %s: %s", c, err)
continue
}

// decide whether to descend into children
descend := entityType != EntityFile && entityType != EntitySymlink
if descend {
// reverse so first link is popped next (left-to-right
// sibling order, matching WalkDAG and legacy BlockAll)
slices.Reverse(children)
stack = append(stack, children...)
}

// skip identity CIDs: content is inline, no need to provide.
// we still descend (above) so an inlined dag-pb directory's
// normal children get provided.
if c.Prefix().MhType == mh.IDENTITY {
continue
}

if !emit(c) {
return nil
}
}

return nil
}
Loading
Loading