This project contains a Python script to download all subtitles from azsubtitles.com.
This project uses uv for package and environment management.
- Python 3.8+
- uv
-
Create a virtual environment and install dependencies:
uv venv uv pip install -r requirements.txt
Alternatively, if you have just
pyproject.toml:uv venv uv pip install . -
Activate the virtual environment:
source .venv/bin/activate(On Windows, use
.venv\Scripts\activate)
Once the environment is activated, run the script:
python scrape.pyThe script will create a language_subtitles directory and download all found .srt files into it.
Hugging Face Dataset
from datasets import load_dataset
ds = load_dataset("pradeepannepu/telugu_subtitle")Pandas
import pandas as pd
df = pd.read_csv("hf://datasets/pradeepannepu/telugu_subtitle/subtitles.csv")Croissant
from mlcroissant import Dataset
ds = Dataset(jsonld="https://huggingface.co/api/datasets/pradeepannepu/telugu_subtitle/croissant")
records = ds.records("default")Subtitles metadata and files are obtained programmatically from the public API endpoints of AZSubtitles: https://www.azsubtitles.com
Full credit: AZSubtitles (AZSubtitles.com). This dataset derivation depends entirely on their publicly exposed API responses.
Collector: scrape.py
Process:
- Enumerates all movies with Telugu subtitles via /api/search?lg=telugu
- Retrieves per-movie details via /api/movie/{UID}
- Filters for Telugu subtitles (Language.Title == "Telugu")
- Downloads the first available file URL for each matching subtitle
telugu_subtitles/
Contains raw downloaded subtitle archive files as returned by AZSubtitles. Filenames are sanitized (alphanumeric, space, dot, underscore, hyphen).
This dataset is intended for:
- NLP experiments (tokenization, language modeling)
- Subtitle format parsing research
- Temporal alignment exploration
Do not redistribute subtitle contents without verifying permission.
If you use these subtitles or derived data:
- Cite AZSubtitles as the original source.
- Provide a link: https://www.azsubtitles.com
- Indicate that acquisition used their public API.
Suggested citation snippet: Data sourced from AZSubtitles (https://www.azsubtitles.com) via automated retrieval of publicly available Telugu subtitle listings.
- Review AZSubtitles Terms of Service before large-scale use or redistribution.
- Subtitles may be copyright works; confirm rights for downstream applications.
- Remove any file upon request from rights holders.
- This repository does not claim ownership of subtitle content.
- Some files may be compressed archives (.zip/.rar).
- Encoding may vary (UTF-8, Windows-1252, etc.).
- Duplicate or near-duplicate subtitles possible.
- No automatic language validation beyond API metadata.
- Decompress archives uniformly.
- Normalize encoding to UTF-8.
- Strip timestamps and formatting for pure text corpora.
- Deduplicate by hash of cleaned content.
Use data respectfully. Avoid generating or distributing infringing derivative works.
Issues: open a repository issue describing the file and concern. Removal requests honored.