fully backport sheltie features to lassie#524
fully backport sheltie features to lassie#524parkan wants to merge 2 commits intofilecoin-project:mainfrom
Conversation
|
backport of all feature work from parkan/sheltie to lassie. sheltie was an experimental HTTP-only fork; this brings its features upstream primary changes
removed
added
|
383f9ed to
71010c7
Compare
b51fe53 to
889568c
Compare
HTTP-only retrieval with frontier-based DAG stitching, delegated routing v1, streaming UnixFS extraction, and parallel leaf block fetching. Removes graphsync, bitswap, heyfil, and libp2p host dependencies. Developed in github.com/parkan/sheltie. Audit trail: https://github.com/parkan/sheltie/tree/audit-history
889568c to
172cf42
Compare
|
(optional and UnixFS conformance is expected to fail here) |
|
@parkan it looks like you've ditched the
|
let me run through the workflow to make sure exactly what I want to happen is happening in more details and post a more detailed reply, I realize the diff is huge so the logic changes are not entirely clear the previous logic did have dups=y as a non-default option with a provision for implementing streaming please give me a day so I can put together the flow diagram simply piping output (with or without dups) to car extract stdin as suggested in the docs did not work for me because it wants a rewindable stream |
|
@rvagg briefly, where this all came from: (1) I cannot find a combination of lassie + go-car versions where this basic example works (directly from https://github.com/filecoin-project/lassie/tree/v0.24.0?tab=readme-ov-file#fetch-example and earlier docs):
you can work around this by piping through (2) i.e. we had some potential savings from not writing duplicate blocks to both tmp car (see below) as well as finalized car but no other benefit since there is no streaming consumer, and at a cost of re-scanning the tmp car and keeping complex state in memory (3) the previous implementation always wrote CARs to tempDir (which I discovered when I couldn't download a graph bigger than my tmpfs despite having plenty of space on --output fs), same thing with dups=y (4) trustless gateway/http retrieval could only fetch contiguous graphs and bailed out on any missing blocks; the bitswap implementation would stitch graphs but mostly because it just followed links and fetched nodes as it saw them, roundtrip per block (also trustless gateway fetches were not verified, addressed in so what this does is:
it's possible that I'm missing some existing way of putting these pieces together but I haven't been able to find it; with this approach I am able to download multi-TiB unixfs graphs, with thousands of files, with <100MB RSS, directly materialized and verified on the fly with tiny CPU cost caveat: sheltie was originally written to solve my own problems; some of the E2E workflows (RetrieveAndExtract in particular) are not fully exercised in tests, fully hand-written dag-pb graphs with no size field likely cannot be extracted on the fly since we don't know byte offsets, a very large transfer failing due to dropped multiplexed HTTP/2 connection (have a fix for that but not merged 🤦) a few other things I'll run some fresh benchmarks tmw when I'm at a machine with enough disk space |
|
I noticed an issue where a very long running HTTP/2 stream drop is not recovered gracefully while benchmarking, fix added I am also going to complete the module rename for sheltie per se, in case this work is rejected, since efforts at keeping a common base at that point don't make sense |
the "rip the bandaid off" work happened in #516, the intent was to follow up with additional patches with the new functionality but a combination of health issues and context switching to too many other projects resulted in shipping 0.25.0 without the real value-adds
WARNING: the below is a mega-patch with best effort at clarity, sheltie was initially meant to be its own thing but multiple people in the ecosystem encouraged me to fully upstream the work
the individual commits have been repackaged in https://github.com/parkan/sheltie/commits/audit-history/ (note: this may have some sed noise from sheltie/lassie renaming, should all be clean in this PR now) for legibility; applying them to this tree piecemeal is an ordeal due to the significant divergence
full changes summary will be in the comment below this one, tested on my own fork first