Releases · Lightning-AI/litData

20 Feb 10:42

dhedey

v0.2.61

83fae9a

LitData v0.2.61 Latest

Latest

Lightning AI ⚡ is excited to announce the release of LitData v0.2.61

Highlights

Fixes regression in 0.2.60 writing to Lightning Storage (#791)

What's Changed

chore(deps): bump mosaicml-streaming from 0.11.0 to 0.13.0 by @dependabot[bot] in #789
fix: Ensure uint64 fields are handled correctly in _create_dataset by @dhedey in #791
fix: Fixes various file/lock delete failures on windows to allow us to unpin lockfile by @dhedey in #792
chore: Bump to 0.2.61 for release by @dhedey in #793

New Contributors

@dhedey made their first contribution in #791

Full Changelog: v0.2.60...v0.2.61

Contributors

dhedey and dependabot

Assets 2

28 Jan 14:26

tchaton

v0.2.60

016caed

Weekly Release 0.2.60

What's Changed

fixed r2 refetch interval by @vlad-heidi in #777
Fix StreamingDataset len after drop_last update by @MagellaX in #778
chore(deps): update sphinx requirement from <7.0,>=6.0 to >=6.0,<9.0 by @dependabot[bot] in #763
chore(deps): bump pytest-rerunfailures from 16.0.1 to 16.1 by @dependabot[bot] in #764
chore(deps): bump the gha-updates group across 1 directory with 3 updates by @dependabot[bot] in #774
[pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in #779
fix: lint errors (UP007, UP045, UP006 & UP035) by @bhimrazy in #754
chore(deps): update coverage requirement from ==7.10.* to ==7.12.* by @dependabot[bot] in #762
chore: add & simplify concurrency setting to CI testing workflow by @bhimrazy in #780
Fix ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early by @philgzl in #771
Bump SDK by @tchaton in #783
chore(deps): bump JamesIves/github-pages-deploy-action from 4.7.6 to 4.8.0 in the gha-updates group by @dependabot[bot] in #782
feat(litdata): Better support for filestore & co by @tchaton in #785
chore(litdata): Pre-release version bump 0.2.60 by @tchaton in #786

New Contributors

@vlad-heidi made their first contribution in #777

Full Changelog: v0.2.59...v0.2.60

Contributors

tchaton, dependabot, and 5 other contributors

Assets 2

13 Dec 00:24

pwgardipee

v0.2.59

5913181

LitData v0.2.59

Lightning AI ⚡ is excited to announce the release of LitData v0.2.59

Changes

Added

add CHANGELOG.md to track project updates by @deependujha in #733
feat: add support to disable external version checks by @sanggusti in #737
feat: Add Python 3.14 zstd builtin support by @bhimrazy in #749
feat: add align_chunking option to preserve deterministic chunk boundaries across workers by @deependujha in #768

Changed

pin: torchaudio to >=2.7.0,<2.9 by @deependujha in #738
ref(test): remove torchaudio dependency and update audio processing to just use soundfile by @bhimrazy in #739

Fixed

fix(ci): failing link checks by @bhimrazy in #748
fix : ZstdError handling for Python <3.14 & >=3.14 compatibility by @bhimrazy in #767
Fix ParallelStreamingDataset with resume=True not resuming after the second epoch when breaking early by @philgzl in #761

Chores

chore(deps): update transformers requirement from <4.53.0 to <4.57.0 by @dependabot[bot] in #723
chore(deps): bump lightning-sdk from 2025.8.1 to 2025.9.30 by @dependabot[bot] in #724
chore(deps): bump pytest-cov from 6.2.1 to 7.0.0 by @dependabot[bot] in #725
chore(deps): bump astral-sh/setup-uv from 6 to 7 in the gha-updates group by @dependabot[bot] in #735
chore(deps): update transformers requirement from <4.57.0 to <4.58.0 by @dependabot[bot] in #746
chore(deps): bump pytest-rerunfailures from 15.1 to 16.0.1 by @dependabot[bot] in #745
chore(deps): bump actions/download-artifact from 5 to 6 in the gha-updates group by @dependabot[bot] in #741
docs: add anchor links to feature sections in README for easy referencing by @VijayVignesh1 in #743
chore(ci): add Python 3.14 to the testing matrix by @bhimrazy in #747
chore: drop support for Python 3.9 (EOL) by @bhimrazy in #751
chore(deps): bump JamesIves/github-pages-deploy-action from 4.7.3 to 4.7.4 in the gha-updates group by @dependabot[bot] in #750

Full Changelog: v0.2.58...v0.2.59

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

New Contributors

@sanggusti made their first contribution in #737
@VijayVignesh1 made their first contribution in #743

Thank you ❤️ and we hope you'll keep them coming!

Contributors

VijayVignesh1, sanggusti, and 4 other contributors

Assets 2

07 Oct 12:17

tchaton

v0.2.58

dad316e

Release 0.2.58

What's Changed

Fix: Hide debug statements behind _DEBUG by @robTheBuildr in #730
Chore: bump version to 0.2.58 by @robTheBuildr in #732

Full Changelog: v0.2.57...v0.2.58

Contributors

robTheBuildr

Assets 2

06 Oct 20:31

tchaton

v0.2.57

695a314

Release 0.2.57

What's Changed

Better support for streaming optimized dataset by @robTheBuildr in #727
Update CODEOWNERS to modify ownership assignments by @Borda in #726
Bump Version 0.2.57 by @tchaton in #729

Full Changelog: v0.2.56...v0.2.57

Contributors

Borda, robTheBuildr, and tchaton

Assets 2

23 Sep 02:49

tchaton

v0.2.56

df92bf8

v0.2.56

What's Changed

Fix(be): Avoid decompression race condition by @robTheBuildr in #718
Bump version 0.2.56 by @tchaton in #719

New Contributors

@robTheBuildr made their first contribution in #718

Full Changelog: v0.2.55...v0.2.56

Contributors

robTheBuildr and tchaton

Assets 2

19 Sep 15:47

pwgardipee

v0.2.55

f990376

LitData v0.2.55

Lightning AI ⚡ is excited to announce the release of LitData v0.2.55

Highlights

[Fixed] Writing compressed data to a lighting_storage folder

This release focuses on fixing errors when writing compressed output data to a lightning_storage folder. Previously, a code snippet like the following would break.

from litdata import StreamingDataset, StreamingDataLoader, optimize
import time

def should_keep(data):
    if data % 2 == 0:
        yield data


if __name__ == "__main__":
    output_dir = "/teamspace/lightning_storage/my-folder-1/output"
    optimize(
        fn=should_keep,
        inputs=list(range(500)),
        output_dir=output_dir,
        chunk_bytes="64MB",
        num_workers=4,
        compression="zstd", # Previously, this would cause an error
    )
    time.sleep(20) 
    dataset = StreamingDataset(output_dir)
    dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
    for _ in dataloader:
        # process code here
        pass

Changes

Fixed

Fix errors when using compression and r2 in optimize() by @pwgardipee in #715

Changed

Remove s5cmd from the R2 downloader by @pwgardipee in #714

Chores

chore(ci): Add step to minimize uv cache in CI workflow by @bhimrazy in #713

Full Changelog: v0.2.54...v0.2.55

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@pwgardipee @bhimrazy

Thank you ❤️ and we hope you'll keep them coming!

Contributors

pwgardipee and bhimrazy

Assets 2

10 Sep 14:02

pwgardipee

v0.2.54

b50d428

LitData v0.2.54

Lightning AI ⚡ is excited to announce the release of LitData v0.2.54

Highlights

Lightning AI Storage - Direct download

Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.

For example, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.

from litdata import StreamingDataset

if __name__ == "__main__":
    data_dir = "/teamspace/lightning_storage/my-bucket-1/data"

    dataset = StreamingDataset(data_dir)

    for sample in dataset:
    	print(sample)

References to any of the following directories will work similarly:

/teamspace/lightning_storage/...
/teamspace/s3_connections/...
/teamspace/gcs_connections/...
/teamspace/s3_folders/...
/teamspace/gcs_folders/...

Changes

Added

Add downloader for R2 by @pwgardipee in #711

Changed

Update README.md by @tchaton in #710

Full Changelog: v0.2.53...v0.2.54

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@pwgardipee

Thank you ❤️ and we hope you'll keep them coming!

Contributors

tchaton and pwgardipee

Assets 2

09 Sep 14:42

pwgardipee

v0.2.53

8a8e651

LitData v0.2.53

Lightning AI ⚡ is excited to announce the release of LitData v0.2.53

Highlights

Lightning AI Storage - Direct download and upload

For example, output artifacts from this code will be directly uploaded to the my-data-1 Lighting Storage bucket.

from litdata import optimize

def should_keep(data):
    if data % 2 == 0:
        yield data

if __name__ == "__main__":
    optimize(
        fn=should_keep,
        inputs=list(range(1000)),
        output_dir="/teamspace/lightning_storage/my-data-1/output",
        chunk_bytes="64MB",
        num_workers=1
    )

Similarly, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.

from litdata import StreamingRawDataset

if __name__ == "__main__":
    data_dir = "/teamspace/lightning_storage/my-bucket-1/data"

    raw_dataset = StreamingRawDataset(data_dir)

    data = list(raw_dataset)
    print(data)

References to any of the following directories will work similarly:

/teamspace/lightning_storage/...
/teamspace/s3_connections/...
/teamspace/gcs_connections/...
/teamspace/s3_folders/...
/teamspace/gcs_folders/...

Changes

Added

Add support for resolving directories in /teamspace/lightning_storage by @bhimrazy in #695
Add support for direct upload to r2 buckets by @pwgardipee in #705
Add readme docs for references to data connection dirs by @pwgardipee in #708

Changed

Remove unnecessary fixed sleep by adding predicate-based path check by @Red-Eyed in #700
ref(resolver): Refactors data connection resolution by adding a helper function and eliminating code duplication. by @bhimrazy in #706

Chores

chore(deps): bump actions/first-interaction from 2 to 3 in the gha-updates group by @dependabot[bot] in #693
chore(deps): update coverage requirement from ==7.8.* to ==7.10.* by @dependabot[bot] in #701
chore(deps): bump pytest-random-order from 1.1.1 to 1.2.0 by @dependabot[bot] in #703
chore(deps): bump cryptography from 45.0.4 to 45.0.7 by @dependabot[bot] in #704
chore(deps): bump the gha-updates group with 3 updates by @dependabot[bot] in #707

Full Changelog: v0.2.52...v0.2.53

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@bhimrazy, @pwgardipee

New Contributors

@Red-Eyed made their first contribution in #700
@pwgardipee made their first contribution in #705

Thank you ❤️ and we hope you'll keep them coming!

Contributors

Red-Eyed, dependabot, and 2 other contributors

Assets 2

12 Aug 10:14

bhimrazy

v0.2.52

810ed23

LitData v0.2.52

Lightning AI ⚡ is excited to announce the release of LitData v0.2.52

Highlights

Grouping Support in StreamingRawDataset

StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.

from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union

class CustomStreamingRawDataset(StreamingRawDataset):
    def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
        # Example: group files in pairs [[image_1, mask_1], ...]
        return files

dataset = CustomStreamingRawDataset("s3://bucket/files/")

Remote Index Caching for Faster Startup

StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.

from litdata import StreamingRawDataset

dataset = StreamingRawDataset("s3://bucket/files/")  # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)  # Force rebuild

Shuffle Control Added to `train_test_split`

Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.

from litdata import train_test_split

train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)

Changes

Added

Added grouping functionality to StreamingRawDataset allowing flexible item structuring in setup method (#665 by @bhimrazy)
Added shuffle parameter to train_test_split (#675 by @otogamer)
Added CI workflow to check for broken links (#676 by @Vimal-Shady)
Added remote and local index caching in StreamingRawDataset to speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)

Changed

Removed asyncio from requirements.txt since it’s included in Python standard library (#670 by @deependujha)
Moved raw dataset code to litdata/raw, expose StreamingRawDataset at top-level (#671 by @bhimrazy)
Updated README with storie for raw vs optimized streaming option (#677 by @bhimrazy)

Fixed

Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)

Chores

Bumped lightning-sdk from 0.1.46 to 2025.8.1 (#668 by @dependabot[bot])
Bumped pytest-rerunfailures from 14.0 to 15.1 (#667 by @dependabot[bot])
Bumped pytest-cov from 6.1.1 to 6.2.1 (#669 by @dependabot[bot])
Bumped the gha-updates group with 2 updates (#690 by @dependabot[bot])
Bumped litdata version to 0.2.52 by (#691 by @bhimrazy)

Full Changelog: v0.2.51...v0.2.52

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy

New Contributors

@Vimal-Shady made their first contribution in #674
@otogamer made their first contribution in #675
@jwills made their first contribution in #683

Thank you ❤️ and we hope you'll keep them coming!

Contributors

jwills, Borda, and 5 other contributors

Assets 2

Releases: Lightning-AI/litData

LitData v0.2.61

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

Weekly Release 0.2.60

What's Changed

New Contributors

Contributors

Uh oh!

LitData v0.2.59

Changes

🧑‍💻 Contributors

New Contributors

Contributors

Uh oh!

Release 0.2.58

What's Changed

Contributors

Uh oh!

Release 0.2.57

What's Changed

Contributors

Uh oh!

v0.2.56

What's Changed

New Contributors

Contributors

Uh oh!

LitData v0.2.55

Highlights

[Fixed] Writing compressed data to a lighting_storage folder

Changes

🧑‍💻 Contributors

Key Contributors

Contributors

Uh oh!

LitData v0.2.54

Highlights

Lightning AI Storage - Direct download

Changes

🧑‍💻 Contributors

Key Contributors

Contributors

Uh oh!

LitData v0.2.53

Highlights

Lightning AI Storage - Direct download and upload

Changes

🧑‍💻 Contributors

Key Contributors

New Contributors

Contributors

Uh oh!

LitData v0.2.52

Highlights

Grouping Support in StreamingRawDataset

Remote Index Caching for Faster Startup

Shuffle Control Added to train_test_split

Changes

🧑‍💻 Contributors

Key Contributors

New Contributors

Contributors

Uh oh!

Shuffle Control Added to `train_test_split`