Telugu Subtitle Scraper

This project contains a Python script to download all subtitles from azsubtitles.com.

Setup and Usage

This project uses uv for package and environment management.

Prerequisites

Python 3.8+
uv

Installation

Create a virtual environment and install dependencies:
```
uv venv
uv pip install -r requirements.txt
```
Alternatively, if you have just pyproject.toml:
```
uv venv
uv pip install .
```
Activate the virtual environment:
```
source .venv/bin/activate
```
(On Windows, use .venv\Scripts\activate)

Running the Scraper

Once the environment is activated, run the script:

python scrape.py

The script will create a language_subtitles directory and download all found .srt files into it.

Telugu Subtitle Dataset

Download Dataset

link

Hugging Face Dataset

from datasets import load_dataset
ds = load_dataset("pradeepannepu/telugu_subtitle")

Pandas

import pandas as pd
df = pd.read_csv("hf://datasets/pradeepannepu/telugu_subtitle/subtitles.csv")

Croissant

from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/pradeepannepu/telugu_subtitle/croissant")
records = ds.records("default")

Source

Subtitles metadata and files are obtained programmatically from the public API endpoints of AZSubtitles: https://www.azsubtitles.com
Full credit: AZSubtitles (AZSubtitles.com). This dataset derivation depends entirely on their publicly exposed API responses.

Script

Collector: scrape.py
Process:

Enumerates all movies with Telugu subtitles via /api/search?lg=telugu
Retrieves per-movie details via /api/movie/{UID}
Filters for Telugu subtitles (Language.Title == "Telugu")
Downloads the first available file URL for each matching subtitle

Intended Use

This dataset is intended for:

NLP experiments (tokenization, language modeling)
Subtitle format parsing research
Temporal alignment exploration

Do not redistribute subtitle contents without verifying permission.

Attribution Requirements

If you use these subtitles or derived data:

Cite AZSubtitles as the original source.
Provide a link: https://www.azsubtitles.com
Indicate that acquisition used their public API.

Suggested citation snippet: Data sourced from AZSubtitles (https://www.azsubtitles.com) via automated retrieval of publicly available Telugu subtitle listings.

Legal / Ethical Notice

Review AZSubtitles Terms of Service before large-scale use or redistribution.
Subtitles may be copyright works; confirm rights for downstream applications.
Remove any file upon request from rights holders.
This repository does not claim ownership of subtitle content.

Quality / Caveats

Some files may be compressed archives (.zip/.rar).
Encoding may vary (UTF-8, Windows-1252, etc.).
Duplicate or near-duplicate subtitles possible.
No automatic language validation beyond API metadata.

Preprocessing Suggestions

Decompress archives uniformly.
Normalize encoding to UTF-8.
Strip timestamps and formatting for pure text corpora.
Deduplicate by hash of cleaned content.

Responsible Use

Use data respectfully. Avoid generating or distributing infringing derivative works.

Contact

Issues: open a repository issue describing the file and concern. Removal requests honored.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
cleanup.py		cleanup.py
doc.html		doc.html
pyproject.toml		pyproject.toml
sample_responses.md		sample_responses.md
scrape.py		scrape.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telugu Subtitle Scraper

Setup and Usage

Prerequisites

Installation

Running the Scraper

Telugu Subtitle Dataset

Download Dataset

Source

Script

Directory

Intended Use

Attribution Requirements

Legal / Ethical Notice

Quality / Caveats

Preprocessing Suggestions

Responsible Use

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Telugu Subtitle Scraper

Setup and Usage

Prerequisites

Installation

Running the Scraper

Telugu Subtitle Dataset

Download Dataset

Source

Script

Directory

Intended Use

Attribution Requirements

Legal / Ethical Notice

Quality / Caveats

Preprocessing Suggestions

Responsible Use

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages