Skip to content

Homework_6Β #161

@The-Paul2002

Description

@The-Paul2002

πŸ“° Assignment: OCR & Digital Analysis of a Historical Newspaper Page

Case Study: El Martillo (Chiclayo, 1903–1919)

Source: https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/


⏰ Deadline

December 6 β€” until 11:59 PM (local time)
Late submissions will not be accepted.


πŸ“¦ Repository Requirement (MANDATORY)

You must create your own GitHub repository for this assignment.
Your repo must contain:

  • README.md
  • The Python notebook
  • The CSV dataset
  • Your short report (Markdown)
  • The image/PDF of the selected newspaper page

Name your repository:
el-martillo-ocr-[yourname]


🎯 Objective

Using Claude API (vision/OCR), digitize and analyze one single scanned page from the historical Peruvian newspaper El Martillo.
Your goal is to transform that page into structured data and produce a short exploratory insight.


πŸ“‚ Required Data

Select ONE newspaper page from El Martillo (any year between 1903–1919).
Official source (required):
πŸ”— https://fuenteshistoricasdelperu.com/2020/12/06/el-martillo-chiclayo-1903-1919/

Save your file as:
/data/el_martillo/page_01.png


πŸ“¦ Deliverables (all must be in your GitHub repo)

  1. Python Notebook (.ipynb)

    • Load the selected page
    • Extract text with Claude API (vision)
    • Structure the extracted output (CSV/JSON)
  2. Structured Dataset (.csv)
    Required columns:

    • date
    • issue_number
    • headline
    • section
    • type (article / advertisement / other)
    • text_excerpt
  3. Short Report (.md)

    • Explain why you selected the page
    • Describe OCR challenges or distortions
    • Include one simple chart
    • Provide 2–3 brief insights
  4. Raw media file:

    • The scanned newspaper page you used (.png or .jpg)

🧠 Tasks to Complete

  1. Choose and download one page from the newspaper.
  2. Run Claude OCR to extract titles, sections, and content.
  3. Normalize and clean the extracted text.
  4. Build a CSV file with structured information.
  5. Write a short summary of insights.
  6. Upload everything to your GitHub repository.

πŸ“‹ Evaluation (20 points)

Criterion Points
Claude OCR extraction 11
Dataset structure & quality 7
Report clarity 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions