-
Notifications
You must be signed in to change notification settings - Fork 4
Description
π° Assignment: OCR & Digital Analysis of a Historical Newspaper Page
Case Study: El Martillo (Chiclayo, 1903β1919)
Source: https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/
β° Deadline
December 6 β until 11:59 PM (local time)
Late submissions will not be accepted.
π¦ Repository Requirement (MANDATORY)
You must create your own GitHub repository for this assignment.
Your repo must contain:
README.md- The Python notebook
- The CSV dataset
- Your short report (Markdown)
- The image/PDF of the selected newspaper page
Name your repository:
el-martillo-ocr-[yourname]
π― Objective
Using Claude API (vision/OCR), digitize and analyze one single scanned page from the historical Peruvian newspaper El Martillo.
Your goal is to transform that page into structured data and produce a short exploratory insight.
π Required Data
Select ONE newspaper page from El Martillo (any year between 1903β1919).
Official source (required):
π https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/
Save your file as:
/data/el_martillo/page_01.png
π¦ Deliverables (all must be in your GitHub repo)
-
Python Notebook (
.ipynb)- Load the selected page
- Extract text with Claude API (vision)
- Structure the extracted output (CSV/JSON)
-
Structured Dataset (
.csv)
Required columns:dateissue_numberheadlinesectiontype(article / advertisement / other)text_excerpt
-
Short Report (
.md)- Explain why you selected the page
- Describe OCR challenges or distortions
- Include one simple chart
- Provide 2β3 brief insights
-
Raw media file:
- The scanned newspaper page you used (
.pngor.jpg)
- The scanned newspaper page you used (
π§ Tasks to Complete
- Choose and download one page from the newspaper.
- Run Claude OCR to extract titles, sections, and content.
- Normalize and clean the extracted text.
- Build a CSV file with structured information.
- Write a short summary of insights.
- Upload everything to your GitHub repository.
π Evaluation (20 points)
| Criterion | Points |
|---|---|
| Claude OCR extraction | 11 |
| Dataset structure & quality | 7 |
| Report clarity | 2 |