This repository contains utilities and data developed to support the registered report: How Do Code Smells Affect Skill Growth in Scratch Novice Programmers?
It includes two Python scripts designed to automate and parallelize the processing of Scratch projects using Litterbox, as well as fetching project titles from the Scratch API.
This work builds upon and complements the following repositories:
- 197_BSsurvey_multiproceso_litterbox.py
- 198_BSsurvey_get_multi_filenames.py
- extract_scratch_meta.py
- sb3_pick.sh
- pick_sb3.ps1
- wrangling_combining_RR.R
A multiprocessing script to execute Litterbox in batch mode, reading a list of projects from a CSV file (one project per row). It supports automatic resumption, skipping failed projects, and consolidating results into a single CSV.
- Automatic Resumption: Skips projects already in
ok_projects.txtand retries those infailed_projects.txt. - Skip Failed Projects: Optionally ignore projects already marked as failed.
- No Temporary Files: Temporary files are always deleted after each run.
- State Files: Only
ok_projects.txt,failed_projects.txt, andlast_project_processed.txtare maintained. - CSV per Worker: Option to write a single CSV per process.
- Consolidation: Generate a unified CSV at the end.
python 197_BSsurvey_multiproceso_litterbox.py \
--csv "path/to/projects.csv" \
--jar "path/to/Litterbox-1.9.2.full.jar" \
--results-dir "path/to/results" \
--output-dir "path/to/output" \
--logs-dir "path/to/logs" \
--tmp-dir "path/to/tmp" \
--state-dir "path/to/state" \
--max-workers 4 --timeout 1620 \
--single-csv-per-worker --consolidate --auto-resume| Argument | Description |
|---|---|
--csv |
Path to the CSV file containing projects. |
--jar |
Path to the Litterbox JAR file. |
--results-dir |
Base directory for Litterbox results. |
--output-dir |
Directory to save output CSVs. |
--logs-dir |
Directory for log files. |
--tmp-dir |
Directory for temporary files. |
--state-dir |
Directory for state files. |
--java-bin |
Path to Java binary (if not in PATH). |
--timeout |
Timeout per project (in seconds). |
--max-workers |
Maximum number of parallel processes. |
--retries |
Number of retries for failed projects. |
--resume-failed |
(Legacy) Process only projects in failed_projects.txt. |
--auto-resume |
Skip OK projects and retry failed ones. |
--skip-failed |
Skip projects already marked as failed. |
--single-csv-per-worker |
Write a single CSV per process. |
--consolidate |
Generate a unified CSV at the end. |
--dry-run |
Print what would be done without executing Java. |
A script to download Scratch project titles in parallel, reading project IDs from a CSV file and writing/updating an output CSV. It avoids re-fetching IDs already present in the output file and supports retry logic with exponential backoff.
- Incremental Fetching: Only fetches IDs not already in the output CSV.
- Parallel Processing: Uses multiprocessing for efficiency.
- Retry Logic: Implements exponential backoff for failed requests.
- Rate Limiting: Respects Scratch API rate limits and includes a courtesy pause between requests.
python 198_BSsurvey_get_multi_filenames.pyNote: Adjust INPUT_CSV and OUTPUT_CSV paths in the script before running.
| Variable | Description |
|---|---|
INPUT_CSV |
Path to the input CSV file containing project IDs. |
OUTPUT_CSV |
Path to the output CSV file for titles. |
API_URL |
Scratch API endpoint for project metadata. |
TIMEOUT |
HTTP request timeout (in seconds). |
MAX_RETRIES |
Maximum number of retries per ID. |
BACKOFF_BASE |
Base delay for exponential backoff (in seconds). |
TOTAL_PAUSE |
Total desired pause between aggregated requests. |
MAX_WORKERS_ENV |
Optional environment variable to set the number of workers. |
A tool to extract metadata from downloaded Scratch .sb3 files using the public Scratch API.
It scans a folder, optionally recursively, and writes a CSV containing project titles, authors, dates, and remix lineage.
- Automatic ID Detection: Derives the project ID from each filename.
- Parallel Requests: Uses threads or processes for faster API calls.
- Robust Output: Stores errors alongside metadata for later review.
- Progress Bar: Reports progress via
tqdm.
python extract_scratch_meta.py --input path/to/sb3_dir --output meta.csv --workers 4A Bash helper that gathers all .sb3 files under a directory and packs them into a single ZIP archive while reporting counts and sizes.
- Recursive Search: Finds every
.sb3file within the source directory. - Portable Archive: Creates a timestamped ZIP by default.
- Summary Report: Prints totals for files, sizes, and elapsed time.
./sb3_pick.sh path/to/source [output.zip]A PowerShell script that copies all .sb3 files from the current directory tree into a destination folder.
- Destination Check: Creates the target directory if it does not exist.
- Verbose Copy: Logs each file copied for easy tracking.
./pick_sb3.ps1 -Destino C:\ruta\destinoAn R script employing the tidyverse to combine titles and analysis results from multiple CSV sources, removing duplicates and aligning column names.
- Dataset Merging: Joins Litterbox outputs with DrScratch results.
- Column Normalisation: Renames and filters fields for compatibility.
- Sample method: Nearest Neighbor Matching
- Export Helpers: Writes consolidated datasets back to CSV.
Rscript wrangling_combining_RR.R- Python 3.9+
- Java (for Litterbox)
- R with
tidyversepackage (for data wrangling script) - Required Python packages:
pandas,requests,tqdm
This project is open-source and available under the Creative Commons Zero v1.0 Universal License.