Skip to content

pradeepannepu/subtitles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telugu Subtitle Scraper

This project contains a Python script to download all subtitles from azsubtitles.com.

Setup and Usage

This project uses uv for package and environment management.

Prerequisites

  • Python 3.8+
  • uv

Installation

  1. Create a virtual environment and install dependencies:

    uv venv
    uv pip install -r requirements.txt

    Alternatively, if you have just pyproject.toml:

    uv venv
    uv pip install .
  2. Activate the virtual environment:

    source .venv/bin/activate

    (On Windows, use .venv\Scripts\activate)

Running the Scraper

Once the environment is activated, run the script:

python scrape.py

The script will create a language_subtitles directory and download all found .srt files into it.

Telugu Subtitle Dataset

Download Dataset

Hugging Face Dataset

from datasets import load_dataset
ds = load_dataset("pradeepannepu/telugu_subtitle")

Pandas

import pandas as pd
df = pd.read_csv("hf://datasets/pradeepannepu/telugu_subtitle/subtitles.csv")

Croissant

from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/pradeepannepu/telugu_subtitle/croissant")
records = ds.records("default")

Source

Subtitles metadata and files are obtained programmatically from the public API endpoints of AZSubtitles: https://www.azsubtitles.com
Full credit: AZSubtitles (AZSubtitles.com). This dataset derivation depends entirely on their publicly exposed API responses.

Script

Collector: scrape.py
Process:

  1. Enumerates all movies with Telugu subtitles via /api/search?lg=telugu
  2. Retrieves per-movie details via /api/movie/{UID}
  3. Filters for Telugu subtitles (Language.Title == "Telugu")
  4. Downloads the first available file URL for each matching subtitle

Directory

telugu_subtitles/
Contains raw downloaded subtitle archive files as returned by AZSubtitles. Filenames are sanitized (alphanumeric, space, dot, underscore, hyphen).

Intended Use

This dataset is intended for:

  • NLP experiments (tokenization, language modeling)
  • Subtitle format parsing research
  • Temporal alignment exploration

Do not redistribute subtitle contents without verifying permission.

Attribution Requirements

If you use these subtitles or derived data:

  • Cite AZSubtitles as the original source.
  • Provide a link: https://www.azsubtitles.com
  • Indicate that acquisition used their public API.

Suggested citation snippet: Data sourced from AZSubtitles (https://www.azsubtitles.com) via automated retrieval of publicly available Telugu subtitle listings.

Legal / Ethical Notice

  • Review AZSubtitles Terms of Service before large-scale use or redistribution.
  • Subtitles may be copyright works; confirm rights for downstream applications.
  • Remove any file upon request from rights holders.
  • This repository does not claim ownership of subtitle content.

Quality / Caveats

  • Some files may be compressed archives (.zip/.rar).
  • Encoding may vary (UTF-8, Windows-1252, etc.).
  • Duplicate or near-duplicate subtitles possible.
  • No automatic language validation beyond API metadata.

Preprocessing Suggestions

  • Decompress archives uniformly.
  • Normalize encoding to UTF-8.
  • Strip timestamps and formatting for pure text corpora.
  • Deduplicate by hash of cleaned content.

Responsible Use

Use data respectfully. Avoid generating or distributing infringing derivative works.

Contact

Issues: open a repository issue describing the file and concern. Removal requests honored.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors