Skip to content

fully backport sheltie features to lassie#524

Draft
parkan wants to merge 2 commits intofilecoin-project:mainfrom
parkan:backport-sheltie-v2
Draft

fully backport sheltie features to lassie#524
parkan wants to merge 2 commits intofilecoin-project:mainfrom
parkan:backport-sheltie-v2

Conversation

@parkan
Copy link
Collaborator

@parkan parkan commented Feb 21, 2026

the "rip the bandaid off" work happened in #516, the intent was to follow up with additional patches with the new functionality but a combination of health issues and context switching to too many other projects resulted in shipping 0.25.0 without the real value-adds

WARNING: the below is a mega-patch with best effort at clarity, sheltie was initially meant to be its own thing but multiple people in the ecosystem encouraged me to fully upstream the work

the individual commits have been repackaged in https://github.com/parkan/sheltie/commits/audit-history/ (note: this may have some sed noise from sheltie/lassie renaming, should all be clean in this PR now) for legibility; applying them to this tree piecemeal is an ordeal due to the significant divergence

full changes summary will be in the comment below this one, tested on my own fork first

@parkan
Copy link
Collaborator Author

parkan commented Feb 21, 2026

backport of all feature work from parkan/sheltie to lassie. sheltie was an experimental HTTP-only fork; this brings its features upstream

primary changes

  • delegated routing v1: replace legacy IPNI API with HTTP delegated routing (/routing/v1/providers/{cid}), with protocol filtering and rate limiting support
  • remove graphsync and bitswap: lassie becomes HTTP-only (trustless gateway protocol) -- already partially in Disable bitswap and upgrade boxo to v0.35.0 #516
  • frontier-based DAG stitching: HybridRetriever attempts whole-DAG CAR fetch first, then recurses to available subgraphs from the provider set
  • parallel leaf block fetching: raw leaf CIDs batched and fetched concurrently
  • streaming output: StreamingStore writes blocks directly to output without temp files
  • --extract flag: streaming UnixFS extraction materializes files directly to disk during retrieval
  • --skip-block-verification flag: optional CID verification bypass, marginal improvement in throughput on modern CPUs but matches previous behavior
  • logging output reformatted: log HTTP endpoints instead of peer IDs (which are meaningless for trustless gateways)

removed

  • pkg/heyfil/ — Filecoin address resolution, dead
  • pkg/net/client/, pkg/net/host/ — libp2p host initialization
  • pkg/retriever/graphsyncretriever.go — graphsync protocol
  • pkg/retriever/combinators/, pkg/retriever/coordinators/ — multi-protocol orchestration
  • pkg/retriever/proposal.go, pkg/retriever/protocolsplitter.go — dead
  • pkg/internal/lp2ptransports/ — not needed anymore
  • pkg/events/graphsync{accepted,proposed}.go — graphsync events

added

  • pkg/blockbroker/ — HTTP block session management
  • pkg/extractor/ — streaming UnixFS extraction
  • pkg/retriever/hybridretriever.go — frontier-based DAG stitching
  • pkg/retriever/frontier.go — frontier tracking for partial DAG retrieval
  • pkg/storage/streamingstore.go — streaming-first storage
  • pkg/indexerlookup/delegated.go — delegated routing v1 response types
  • pkg/events/extraction.go — extraction events

@parkan parkan force-pushed the backport-sheltie-v2 branch from 383f9ed to 71010c7 Compare February 21, 2026 18:51
@parkan parkan requested a review from rvagg February 21, 2026 18:56
@parkan parkan force-pushed the backport-sheltie-v2 branch 3 times, most recently from b51fe53 to 889568c Compare February 21, 2026 19:21
HTTP-only retrieval with frontier-based DAG stitching, delegated
routing v1, streaming UnixFS extraction, and parallel leaf block
fetching. Removes graphsync, bitswap, heyfil, and libp2p host
dependencies.

Developed in github.com/parkan/sheltie. Audit trail:
https://github.com/parkan/sheltie/tree/audit-history
@parkan parkan force-pushed the backport-sheltie-v2 branch from 889568c to 172cf42 Compare February 21, 2026 19:29
@parkan
Copy link
Collaborator Author

parkan commented Feb 21, 2026

(optional and UnixFS conformance is expected to fail here)

@parkan parkan marked this pull request as draft February 21, 2026 21:18
@rvagg
Copy link
Member

rvagg commented Feb 23, 2026

@parkan it looks like you've ditched the dups option and gone with a "store temporary CAR to extract from" option? The point of dups is that you then don't need a temporary CAR for a direct unixfs extraction - each next block you get should be the next block that a unixfs extraction needs. Where there's a mismatch then something's wrong. So you should be able to just hook up the incoming block stream as the blockstore for unixfs such that every Get(). As long as the unixfs extractor is fed the same path that the retrieval was given then it's the same set of blocks in the same order. It looks like pkg/extractor/extractor.go is doing its own traversal, I've only skimmed so I may be wrong but it looks like you might be doing it the hard way of implementing a custom traversal and unixfs parsing engine and that should already exist for you I think.

dups=y was implemented for exactly this case - consider a low-resource environment like a cloudflare worker, able to retrieve a stream from some endpoint and extract a unixfs file/directory/whatever from it and it doesn't have the ability to accumulate in memory or store it anywhere and block until it's ready. You should be able to stream the data out as quickly as you get it in and not have to accumulate.

@parkan
Copy link
Collaborator Author

parkan commented Feb 23, 2026

@parkan it looks like you've ditched the dups option and gone with a "store temporary CAR to extract from" option? The point of dups is that you then don't need a temporary CAR for a direct unixfs extraction - each next block you get should be the next block that a unixfs extraction needs. Where there's a mismatch then something's wrong. So you should be able to just hook up the incoming block stream as the blockstore for unixfs such that every Get(). As long as the unixfs extractor is fed the same path that the retrieval was given then it's the same set of blocks in the same order. It looks like pkg/extractor/extractor.go is doing its own traversal, I've only skimmed so I may be wrong but it looks like you might be doing it the hard way of implementing a custom traversal and unixfs parsing engine and that should already exist for you I think.

dups=y was implemented for exactly this case - consider a low-resource environment like a cloudflare worker, able to retrieve a stream from some endpoint and extract a unixfs file/directory/whatever from it and it doesn't have the ability to accumulate in memory or store it anywhere and block until it's ready. You should be able to stream the data out as quickly as you get it in and not have to accumulate.

let me run through the workflow to make sure exactly what I want to happen is happening in more details and post a more detailed reply, I realize the diff is huge so the logic changes are not entirely clear

the previous logic did have dups=y as a non-default option with a provision for implementing streaming

please give me a day so I can put together the flow diagram

simply piping output (with or without dups) to car extract stdin as suggested in the docs did not work for me because it wants a rewindable stream

@parkan
Copy link
Collaborator Author

parkan commented Feb 23, 2026

@rvagg briefly, where this all came from:

(1) I cannot find a combination of lassie + go-car versions where this basic example works (directly from https://github.com/filecoin-project/lassie/tree/v0.24.0?tab=readme-ov-file#fetch-example and earlier docs):

$ lassie fetch -o - -p bafybeic56z3yccnla3cutmvqsn5zy3g24muupcsjtoyp3pu5pm5amurjx4 | car extract
EOF

go-unixfsnode wants blockstore.Get(cid) but stdin isn't seekable; go-car tries to consume the input for random reads (even if actually in in-order) and EOFs

you can work around this by piping through cat or sponge etc which buffers the CAR stream but results in RSS more than 2x the input size in my tests

(2) --dups had no effect on (1) and the docstring reads:

allow duplicate blocks to be written to the output CAR, which may be useful for streaming. (default: false)

i.e. we had some potential savings from not writing duplicate blocks to both tmp car (see below) as well as finalized car but no other benefit since there is no streaming consumer, and at a cost of re-scanning the tmp car and keeping complex state in memory

(3) the previous implementation always wrote CARs to tempDir (which I discovered when I couldn't download a graph bigger than my tmpfs despite having plenty of space on --output fs), same thing with dups=y

(4) trustless gateway/http retrieval could only fetch contiguous graphs and bailed out on any missing blocks; the bitswap implementation would stitch graphs but mostly because it just followed links and fetched nodes as it saw them, roundtrip per block (also trustless gateway fetches were not verified, addressed in BlockBroker)

so what this does is:

  • flips dups (default: false) to stream (default: true) which also bypasses tmpDir writes, i.e. we actually keep dups and lean into it
  • re-implements DFS/seen blocks logic from dups with a compact in-memory representation
  • fits the verified subgraphs together based on the map; instead of racing protocols and picking the fastest replies we request subgraphs from providers and fit them together as needed, thus out-of-order handling
  • --extract materializes unixfs files and directories via the above mechanism directly

it's possible that I'm missing some existing way of putting these pieces together but I haven't been able to find it; with this approach I am able to download multi-TiB unixfs graphs, with thousands of files, with <100MB RSS, directly materialized and verified on the fly with tiny CPU cost

caveat: sheltie was originally written to solve my own problems; some of the E2E workflows (RetrieveAndExtract in particular) are not fully exercised in tests, fully hand-written dag-pb graphs with no size field likely cannot be extracted on the fly since we don't know byte offsets, a very large transfer failing due to dropped multiplexed HTTP/2 connection (have a fix for that but not merged 🤦) a few other things

I'll run some fresh benchmarks tmw when I'm at a machine with enough disk space

@parkan
Copy link
Collaborator Author

parkan commented Feb 24, 2026

I noticed an issue where a very long running HTTP/2 stream drop is not recovered gracefully while benchmarking, fix added

I am also going to complete the module rename for sheltie per se, in case this work is rejected, since efforts at keeping a common base at that point don't make sense

@parkan
Copy link
Collaborator Author

parkan commented Feb 24, 2026

these may help understand what's happening:

lassie-retriever
lassie-extractor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants