fully backport sheltie features to lassie by parkan · Pull Request #524 · filecoin-project/lassie

parkan · 2026-02-21T18:48:14Z

the "rip the bandaid off" work happened in #516, the intent was to follow up with additional patches with the new functionality but a combination of health issues and context switching to too many other projects resulted in shipping 0.25.0 without the real value-adds

WARNING: the below is a mega-patch with best effort at clarity, sheltie was initially meant to be its own thing but multiple people in the ecosystem encouraged me to fully upstream the work

the individual commits have been repackaged in https://github.com/parkan/sheltie/commits/audit-history/ (note: this may have some sed noise from sheltie/lassie renaming, should all be clean in this PR now) for legibility; applying them to this tree piecemeal is an ordeal due to the significant divergence

full changes summary will be in the comment below this one, tested on my own fork first

parkan · 2026-02-21T18:48:52Z

backport of all feature work from parkan/sheltie to lassie. sheltie was an experimental HTTP-only fork; this brings its features upstream

primary changes

delegated routing v1: replace legacy IPNI API with HTTP delegated routing (/routing/v1/providers/{cid}), with protocol filtering and rate limiting support
remove graphsync and bitswap: lassie becomes HTTP-only (trustless gateway protocol) -- already partially in Disable bitswap and upgrade boxo to v0.35.0 #516
frontier-based DAG stitching: HybridRetriever attempts whole-DAG CAR fetch first, then recurses to available subgraphs from the provider set
parallel leaf block fetching: raw leaf CIDs batched and fetched concurrently
streaming output: StreamingStore writes blocks directly to output without temp files
--extract flag: streaming UnixFS extraction materializes files directly to disk during retrieval
--skip-block-verification flag: optional CID verification bypass, marginal improvement in throughput on modern CPUs but matches previous behavior
logging output reformatted: log HTTP endpoints instead of peer IDs (which are meaningless for trustless gateways)

removed

pkg/heyfil/ — Filecoin address resolution, dead
pkg/net/client/, pkg/net/host/ — libp2p host initialization
pkg/retriever/graphsyncretriever.go — graphsync protocol
pkg/retriever/combinators/, pkg/retriever/coordinators/ — multi-protocol orchestration
pkg/retriever/proposal.go, pkg/retriever/protocolsplitter.go — dead
pkg/internal/lp2ptransports/ — not needed anymore
pkg/events/graphsync{accepted,proposed}.go — graphsync events

added

pkg/blockbroker/ — HTTP block session management
pkg/extractor/ — streaming UnixFS extraction
pkg/retriever/hybridretriever.go — frontier-based DAG stitching
pkg/retriever/frontier.go — frontier tracking for partial DAG retrieval
pkg/storage/streamingstore.go — streaming-first storage
pkg/indexerlookup/delegated.go — delegated routing v1 response types
pkg/events/extraction.go — extraction events

HTTP-only retrieval with frontier-based DAG stitching, delegated routing v1, streaming UnixFS extraction, and parallel leaf block fetching. Removes graphsync, bitswap, heyfil, and libp2p host dependencies. Developed in github.com/parkan/sheltie. Audit trail: https://github.com/parkan/sheltie/tree/audit-history

parkan · 2026-02-21T20:27:09Z

(optional and UnixFS conformance is expected to fail here)

rvagg · 2026-02-23T04:38:29Z

@parkan it looks like you've ditched the dups option and gone with a "store temporary CAR to extract from" option? The point of dups is that you then don't need a temporary CAR for a direct unixfs extraction - each next block you get should be the next block that a unixfs extraction needs. Where there's a mismatch then something's wrong. So you should be able to just hook up the incoming block stream as the blockstore for unixfs such that every Get(). As long as the unixfs extractor is fed the same path that the retrieval was given then it's the same set of blocks in the same order. It looks like pkg/extractor/extractor.go is doing its own traversal, I've only skimmed so I may be wrong but it looks like you might be doing it the hard way of implementing a custom traversal and unixfs parsing engine and that should already exist for you I think.

dups=y was implemented for exactly this case - consider a low-resource environment like a cloudflare worker, able to retrieve a stream from some endpoint and extract a unixfs file/directory/whatever from it and it doesn't have the ability to accumulate in memory or store it anywhere and block until it's ready. You should be able to stream the data out as quickly as you get it in and not have to accumulate.

parkan · 2026-02-23T13:21:08Z

@parkan it looks like you've ditched the dups option and gone with a "store temporary CAR to extract from" option? The point of dups is that you then don't need a temporary CAR for a direct unixfs extraction - each next block you get should be the next block that a unixfs extraction needs. Where there's a mismatch then something's wrong. So you should be able to just hook up the incoming block stream as the blockstore for unixfs such that every Get(). As long as the unixfs extractor is fed the same path that the retrieval was given then it's the same set of blocks in the same order. It looks like pkg/extractor/extractor.go is doing its own traversal, I've only skimmed so I may be wrong but it looks like you might be doing it the hard way of implementing a custom traversal and unixfs parsing engine and that should already exist for you I think.

dups=y was implemented for exactly this case - consider a low-resource environment like a cloudflare worker, able to retrieve a stream from some endpoint and extract a unixfs file/directory/whatever from it and it doesn't have the ability to accumulate in memory or store it anywhere and block until it's ready. You should be able to stream the data out as quickly as you get it in and not have to accumulate.

let me run through the workflow to make sure exactly what I want to happen is happening in more details and post a more detailed reply, I realize the diff is huge so the logic changes are not entirely clear

the previous logic did have dups=y as a non-default option with a provision for implementing streaming

please give me a day so I can put together the flow diagram

simply piping output (with or without dups) to car extract stdin as suggested in the docs did not work for me because it wants a rewindable stream

parkan · 2026-02-23T15:50:50Z

@rvagg briefly, where this all came from:

(1) I cannot find a combination of lassie + go-car versions where this basic example works (directly from https://github.com/filecoin-project/lassie/tree/v0.24.0?tab=readme-ov-file#fetch-example and earlier docs):

$ lassie fetch -o - -p bafybeic56z3yccnla3cutmvqsn5zy3g24muupcsjtoyp3pu5pm5amurjx4 | car extract
EOF

go-unixfsnode wants blockstore.Get(cid) but stdin isn't seekable; go-car tries to consume the input for random reads (even if actually in in-order) and EOFs

you can work around this by piping through cat or sponge etc which buffers the CAR stream but results in RSS more than 2x the input size in my tests

(2) --dups had no effect on (1) and the docstring reads:

allow duplicate blocks to be written to the output CAR, which may be useful for streaming. (default: false)

i.e. we had some potential savings from not writing duplicate blocks to both tmp car (see below) as well as finalized car but no other benefit since there is no streaming consumer, and at a cost of re-scanning the tmp car and keeping complex state in memory

(3) the previous implementation always wrote CARs to tempDir (which I discovered when I couldn't download a graph bigger than my tmpfs despite having plenty of space on --output fs), same thing with dups=y

(4) trustless gateway/http retrieval could only fetch contiguous graphs and bailed out on any missing blocks; the bitswap implementation would stitch graphs but mostly because it just followed links and fetched nodes as it saw them, roundtrip per block (also trustless gateway fetches were not verified, addressed in BlockBroker)

so what this does is:

flips dups (default: false) to stream (default: true) which also bypasses tmpDir writes, i.e. we actually keep dups and lean into it
re-implements DFS/seen blocks logic from dups with a compact in-memory representation
fits the verified subgraphs together based on the map; instead of racing protocols and picking the fastest replies we request subgraphs from providers and fit them together as needed, thus out-of-order handling
--extract materializes unixfs files and directories via the above mechanism directly

it's possible that I'm missing some existing way of putting these pieces together but I haven't been able to find it; with this approach I am able to download multi-TiB unixfs graphs, with thousands of files, with <100MB RSS, directly materialized and verified on the fly with tiny CPU cost

caveat: sheltie was originally written to solve my own problems; some of the E2E workflows (RetrieveAndExtract in particular) are not fully exercised in tests, fully hand-written dag-pb graphs with no size field likely cannot be extracted on the fly since we don't know byte offsets, a very large transfer failing due to dropped multiplexed HTTP/2 connection (have a fix for that but not merged 🤦) a few other things

I'll run some fresh benchmarks tmw when I'm at a machine with enough disk space

…me SP

parkan · 2026-02-24T09:20:07Z

I noticed an issue where a very long running HTTP/2 stream drop is not recovered gracefully while benchmarking, fix added

I am also going to complete the module rename for sheltie per se, in case this work is rejected, since efforts at keeping a common base at that point don't make sense

parkan · 2026-02-24T09:57:56Z

these may help understand what's happening:

parkan force-pushed the backport-sheltie-v2 branch from 383f9ed to 71010c7 Compare February 21, 2026 18:51

parkan requested a review from rvagg February 21, 2026 18:56

parkan force-pushed the backport-sheltie-v2 branch 3 times, most recently from b51fe53 to 889568c Compare February 21, 2026 19:21

parkan force-pushed the backport-sheltie-v2 branch from 889568c to 172cf42 Compare February 21, 2026 19:29

parkan marked this pull request as draft February 21, 2026 21:18

gracefully address case where multiplexed HTTP/2 stream drops from sa…

370069c

…me SP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fully backport sheltie features to lassie#524

fully backport sheltie features to lassie#524
parkan wants to merge 2 commits intofilecoin-project:mainfrom
parkan:backport-sheltie-v2

parkan commented Feb 21, 2026 •

edited

Loading

Uh oh!

parkan commented Feb 21, 2026

Uh oh!

parkan commented Feb 21, 2026

Uh oh!

rvagg commented Feb 23, 2026

Uh oh!

parkan commented Feb 23, 2026 •

edited

Loading

Uh oh!

parkan commented Feb 23, 2026 •

edited

Loading

Uh oh!

parkan commented Feb 24, 2026

Uh oh!

parkan commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

parkan commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parkan commented Feb 21, 2026

primary changes

removed

added

Uh oh!

parkan commented Feb 21, 2026

Uh oh!

rvagg commented Feb 23, 2026

Uh oh!

parkan commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parkan commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parkan commented Feb 24, 2026

Uh oh!

parkan commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

parkan commented Feb 21, 2026 •

edited

Loading

parkan commented Feb 23, 2026 •

edited

Loading

parkan commented Feb 23, 2026 •

edited

Loading