A robust Python-based scraping tool designed to extract comprehensive university program data from YÖK Atlas. It captures rankings, quotas, and score requirements across all major Turkish university entrance exam score types.
Get everything ready and running in minutes:
# 1. Install dependencies
uv sync
# 2. Run the full pipeline (Scrape all types + Finalize + Analytics)
python sync.py --headlessThe project consists of several specialized scripts:
| Script | Purpose |
|---|---|
sync.py |
Orchestrator: Runs the full pipeline (scraping -> finalization -> analytics). |
main.py |
Scraper: The core engine using Selenium and BeautifulSoup. |
finalize.py |
Processor: Normalizes, cleans, and merges raw JSON files into data.json. |
analytics.py |
Reporter: Provides high-level statistics about the collected data. |
Adding data for a new academic year is now a single-command process.
Use the sync.py script to update everything:
python sync.py --year 2026 --headlessIf you want to scrape specific score types only:
python main.py --score-type say --year 2026 --output data_2026_say.jsonTip
The scraper stores year on each record. The deduplication key is code:year, allowing multiple years to exist within the same file without conflict.
--score-type {say,ea,soz,dil,tyt}: Specific score type (default:say).--output FILE: Custom output path.--headless: Run without a browser window.--year YEAR: Specific year to target.--all-types: Scrape all types sequentially (with built-in delays).
- Scrape:
main.pyextracts data intouniversities_data_{type}.json. - Normalize:
finalize.pycleans up fields (like "Doldu#" -> "Doldu") and merges files intodata.json. - Analyze:
analytics.pyprints a summary of the dataset.
The final data.json contains records with the following structure:
{
"code": "203910830",
"year": 2025,
"university_name": "KOÇ ÜNİVERSİTESİ",
"name": "Karşılaştırmalı Edebiyat",
"attributes": ["İngilizce", "Burslu", "4 Yıllık"],
"city": "İSTANBUL",
"university_type": "Vakıf",
"scholarship_type": "Burslu",
"education_type": "Örgün",
"total_quota": ["3+0", "3+0", "3+0", "3+0"],
"quota_status": "Doldu",
"filled_quota": ["3", "3", "3", "3"],
"max_rank": ["215", "606", "516", "513"],
"min_score": ["536,38093", "503,50496", "521,12754", "519,65975"],
"score_type": "dil"
}- Python 3.13+
- Google Chrome (required for Selenium)
- Dependencies:
selenium,beautifulsoup4,requests,lxml,webdriver-manager
- ✅ Resume Capability: Skips already scraped programs if interrupted.
- ✅ Incremental Saving: Saves data after every page.
- ✅ Anti-Detection: Uses random user agents and realistic delays.
- ✅ Memory Efficient: Headless mode support for low resource usage.