Skip to content

ayakutt/yokatlas-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎓 Yokatlas University Scraper

A robust Python-based scraping tool designed to extract comprehensive university program data from YÖK Atlas. It captures rankings, quotas, and score requirements across all major Turkish university entrance exam score types.


🚀 Quick Start

Get everything ready and running in minutes:

# 1. Install dependencies
uv sync

# 2. Run the full pipeline (Scrape all types + Finalize + Analytics)
python sync.py --headless

🛠️ Components

The project consists of several specialized scripts:

Script Purpose
sync.py Orchestrator: Runs the full pipeline (scraping -> finalization -> analytics).
main.py Scraper: The core engine using Selenium and BeautifulSoup.
finalize.py Processor: Normalizes, cleans, and merges raw JSON files into data.json.
analytics.py Reporter: Provides high-level statistics about the collected data.

📊 How to Add a New Year's Data

Adding data for a new academic year is now a single-command process.

The Easy Way (Recommended)

Use the sync.py script to update everything:

python sync.py --year 2026 --headless

The Granular Way

If you want to scrape specific score types only:

python main.py --score-type say --year 2026 --output data_2026_say.json

Tip

The scraper stores year on each record. The deduplication key is code:year, allowing multiple years to exist within the same file without conflict.


⚙️ Advanced Usage

main.py Options

  • --score-type {say,ea,soz,dil,tyt}: Specific score type (default: say).
  • --output FILE: Custom output path.
  • --headless: Run without a browser window.
  • --year YEAR: Specific year to target.
  • --all-types: Scrape all types sequentially (with built-in delays).

Pipeline Flow

  1. Scrape: main.py extracts data into universities_data_{type}.json.
  2. Normalize: finalize.py cleans up fields (like "Doldu#" -> "Doldu") and merges files into data.json.
  3. Analyze: analytics.py prints a summary of the dataset.

📦 Data Format

The final data.json contains records with the following structure:

{
  "code": "203910830",
  "year": 2025,
  "university_name": "KOÇ ÜNİVERSİTESİ",
  "name": "Karşılaştırmalı Edebiyat",
  "attributes": ["İngilizce", "Burslu", "4 Yıllık"],
  "city": "İSTANBUL",
  "university_type": "Vakıf",
  "scholarship_type": "Burslu",
  "education_type": "Örgün",
  "total_quota": ["3+0", "3+0", "3+0", "3+0"],
  "quota_status": "Doldu",
  "filled_quota": ["3", "3", "3", "3"],
  "max_rank": ["215", "606", "516", "513"],
  "min_score": ["536,38093", "503,50496", "521,12754", "519,65975"],
  "score_type": "dil"
}

🔧 Prerequisites

  • Python 3.13+
  • Google Chrome (required for Selenium)
  • Dependencies: selenium, beautifulsoup4, requests, lxml, webdriver-manager

🛡️ Features

  • Resume Capability: Skips already scraped programs if interrupted.
  • Incremental Saving: Saves data after every page.
  • Anti-Detection: Uses random user agents and realistic delays.
  • Memory Efficient: Headless mode support for low resource usage.

About

A Python scraper that extracts university program data from YÖK Atlas (yokatlas.yok.gov.tr), including rankings, quotas, and score requirements for all score types.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages