Releases: Lightning-AI/litData
LitData v0.2.61
Lightning AI ⚡ is excited to announce the release of LitData v0.2.61
Highlights
Fixes regression in 0.2.60 writing to Lightning Storage (#791)
What's Changed
- chore(deps): bump mosaicml-streaming from 0.11.0 to 0.13.0 by @dependabot[bot] in #789
- fix: Ensure uint64 fields are handled correctly in
_create_datasetby @dhedey in #791 - fix: Fixes various file/lock delete failures on windows to allow us to unpin lockfile by @dhedey in #792
- chore: Bump to 0.2.61 for release by @dhedey in #793
New Contributors
Full Changelog: v0.2.60...v0.2.61
Weekly Release 0.2.60
What's Changed
- fixed r2 refetch interval by @vlad-heidi in #777
- Fix StreamingDataset len after drop_last update by @MagellaX in #778
- chore(deps): update sphinx requirement from <7.0,>=6.0 to >=6.0,<9.0 by @dependabot[bot] in #763
- chore(deps): bump pytest-rerunfailures from 16.0.1 to 16.1 by @dependabot[bot] in #764
- chore(deps): bump the gha-updates group across 1 directory with 3 updates by @dependabot[bot] in #774
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in #779
- fix: lint errors (UP007, UP045, UP006 & UP035) by @bhimrazy in #754
- chore(deps): update coverage requirement from ==7.10.* to ==7.12.* by @dependabot[bot] in #762
- chore: add & simplify concurrency setting to CI testing workflow by @bhimrazy in #780
- Fix
ParallelStreamingDatasetwithresume=Truenot resuming after loading a state dict when breaking early by @philgzl in #771 - Bump SDK by @tchaton in #783
- chore(deps): bump JamesIves/github-pages-deploy-action from 4.7.6 to 4.8.0 in the gha-updates group by @dependabot[bot] in #782
- feat(litdata): Better support for filestore & co by @tchaton in #785
- chore(litdata): Pre-release version bump 0.2.60 by @tchaton in #786
New Contributors
- @vlad-heidi made their first contribution in #777
Full Changelog: v0.2.59...v0.2.60
LitData v0.2.59
Lightning AI ⚡ is excited to announce the release of LitData v0.2.59
Changes
Added
- add
CHANGELOG.mdto track project updates by @deependujha in #733 - feat: add support to disable external version checks by @sanggusti in #737
- feat: Add Python 3.14 zstd builtin support by @bhimrazy in #749
- feat: add
align_chunkingoption to preserve deterministic chunk boundaries across workers by @deependujha in #768
Changed
- pin: torchaudio to
>=2.7.0,<2.9by @deependujha in #738 - ref(test): remove torchaudio dependency and update audio processing to just use soundfile by @bhimrazy in #739
Fixed
Chores
- chore(deps): update transformers requirement from <4.53.0 to <4.57.0 by @dependabot[bot] in #723
- chore(deps): bump lightning-sdk from 2025.8.1 to 2025.9.30 by @dependabot[bot] in #724
- chore(deps): bump pytest-cov from 6.2.1 to 7.0.0 by @dependabot[bot] in #725
- chore(deps): bump astral-sh/setup-uv from 6 to 7 in the gha-updates group by @dependabot[bot] in #735
- chore(deps): update transformers requirement from <4.57.0 to <4.58.0 by @dependabot[bot] in #746
- chore(deps): bump pytest-rerunfailures from 15.1 to 16.0.1 by @dependabot[bot] in #745
- chore(deps): bump actions/download-artifact from 5 to 6 in the gha-updates group by @dependabot[bot] in #741
- docs: add anchor links to feature sections in README for easy referencing by @VijayVignesh1 in #743
- chore(ci): add Python 3.14 to the testing matrix by @bhimrazy in #747
- chore: drop support for Python 3.9 (EOL) by @bhimrazy in #751
- chore(deps): bump JamesIves/github-pages-deploy-action from 4.7.3 to 4.7.4 in the gha-updates group by @dependabot[bot] in #750
Full Changelog: v0.2.58...v0.2.59
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
New Contributors
- @sanggusti made their first contribution in #737
- @VijayVignesh1 made their first contribution in #743
Thank you ❤️ and we hope you'll keep them coming!
Release 0.2.58
What's Changed
- Fix: Hide debug statements behind _DEBUG by @robTheBuildr in #730
- Chore: bump version to 0.2.58 by @robTheBuildr in #732
Full Changelog: v0.2.57...v0.2.58
Release 0.2.57
What's Changed
- Better support for streaming optimized dataset by @robTheBuildr in #727
- Update CODEOWNERS to modify ownership assignments by @Borda in #726
- Bump Version 0.2.57 by @tchaton in #729
Full Changelog: v0.2.56...v0.2.57
v0.2.56
What's Changed
- Fix(be): Avoid decompression race condition by @robTheBuildr in #718
- Bump version 0.2.56 by @tchaton in #719
New Contributors
- @robTheBuildr made their first contribution in #718
Full Changelog: v0.2.55...v0.2.56
LitData v0.2.55
Lightning AI ⚡ is excited to announce the release of LitData v0.2.55
Highlights
[Fixed] Writing compressed data to a lighting_storage folder
This release focuses on fixing errors when writing compressed output data to a lightning_storage folder. Previously, a code snippet like the following would break.
from litdata import StreamingDataset, StreamingDataLoader, optimize
import time
def should_keep(data):
if data % 2 == 0:
yield data
if __name__ == "__main__":
output_dir = "/teamspace/lightning_storage/my-folder-1/output"
optimize(
fn=should_keep,
inputs=list(range(500)),
output_dir=output_dir,
chunk_bytes="64MB",
num_workers=4,
compression="zstd", # Previously, this would cause an error
)
time.sleep(20)
dataset = StreamingDataset(output_dir)
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
for _ in dataloader:
# process code here
passChanges
Fixed
- Fix errors when using compression and r2 in optimize() by @pwgardipee in #715
Changed
- Remove s5cmd from the R2 downloader by @pwgardipee in #714
Full Changelog: v0.2.54...v0.2.55
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.54
Lightning AI ⚡ is excited to announce the release of LitData v0.2.54
Highlights
Lightning AI Storage - Direct download
Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.
For example, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.
from litdata import StreamingDataset
if __name__ == "__main__":
data_dir = "/teamspace/lightning_storage/my-bucket-1/data"
dataset = StreamingDataset(data_dir)
for sample in dataset:
print(sample)References to any of the following directories will work similarly:
/teamspace/lightning_storage/.../teamspace/s3_connections/.../teamspace/gcs_connections/.../teamspace/s3_folders/.../teamspace/gcs_folders/...
Changes
Added
- Add downloader for R2 by @pwgardipee in #711
Full Changelog: v0.2.53...v0.2.54
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.53
Lightning AI ⚡ is excited to announce the release of LitData v0.2.53
Highlights
Lightning AI Storage - Direct download and upload
Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.
For example, output artifacts from this code will be directly uploaded to the my-data-1 Lighting Storage bucket.
from litdata import optimize
def should_keep(data):
if data % 2 == 0:
yield data
if __name__ == "__main__":
optimize(
fn=should_keep,
inputs=list(range(1000)),
output_dir="/teamspace/lightning_storage/my-data-1/output",
chunk_bytes="64MB",
num_workers=1
)Similarly, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.
from litdata import StreamingRawDataset
if __name__ == "__main__":
data_dir = "/teamspace/lightning_storage/my-bucket-1/data"
raw_dataset = StreamingRawDataset(data_dir)
data = list(raw_dataset)
print(data)References to any of the following directories will work similarly:
/teamspace/lightning_storage/.../teamspace/s3_connections/.../teamspace/gcs_connections/.../teamspace/s3_folders/.../teamspace/gcs_folders/...
Changes
Added
- Add support for resolving directories in
/teamspace/lightning_storageby @bhimrazy in #695 - Add support for direct upload to r2 buckets by @pwgardipee in #705
- Add readme docs for references to data connection dirs by @pwgardipee in #708
Changed
Chores
- chore(deps): bump actions/first-interaction from 2 to 3 in the gha-updates group by @dependabot[bot] in #693
- chore(deps): update coverage requirement from ==7.8.* to ==7.10.* by @dependabot[bot] in #701
- chore(deps): bump pytest-random-order from 1.1.1 to 1.2.0 by @dependabot[bot] in #703
- chore(deps): bump cryptography from 45.0.4 to 45.0.7 by @dependabot[bot] in #704
- chore(deps): bump the gha-updates group with 3 updates by @dependabot[bot] in #707
Full Changelog: v0.2.52...v0.2.53
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
New Contributors
- @Red-Eyed made their first contribution in #700
- @pwgardipee made their first contribution in #705
Thank you ❤️ and we hope you'll keep them coming!
LitData v0.2.52
Lightning AI ⚡ is excited to announce the release of LitData v0.2.52
Highlights
Grouping Support in StreamingRawDataset
StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.
from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union
class CustomStreamingRawDataset(StreamingRawDataset):
def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
# Example: group files in pairs [[image_1, mask_1], ...]
return files
dataset = CustomStreamingRawDataset("s3://bucket/files/")Remote Index Caching for Faster Startup
StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.
from litdata import StreamingRawDataset
dataset = StreamingRawDataset("s3://bucket/files/") # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True) # Force rebuildShuffle Control Added to train_test_split
Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.
from litdata import train_test_split
train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)Changes
Added
- Added grouping functionality to
StreamingRawDatasetallowing flexible item structuring insetupmethod (#665 by @bhimrazy) - Added shuffle parameter to
train_test_split(#675 by @otogamer) - Added CI workflow to check for broken links (#676 by @Vimal-Shady)
- Added remote and local index caching in
StreamingRawDatasetto speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)
Changed
Fixed
- Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
- Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
- Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
- Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)
Chores
- Bumped lightning-sdk from 0.1.46 to 2025.8.1 (#668 by @dependabot[bot])
- Bumped pytest-rerunfailures from 14.0 to 15.1 (#667 by @dependabot[bot])
- Bumped pytest-cov from 6.1.1 to 6.2.1 (#669 by @dependabot[bot])
- Bumped the gha-updates group with 2 updates (#690 by @dependabot[bot])
- Bumped
litdataversion to 0.2.52 by (#691 by @bhimrazy)
Full Changelog: v0.2.51...v0.2.52
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
@deependujha, @Borda, @bhimrazy
New Contributors
- @Vimal-Shady made their first contribution in #674
- @otogamer made their first contribution in #675
- @jwills made their first contribution in #683
Thank you ❤️ and we hope you'll keep them coming!