Skip to content

primeid/intelligenz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligenz Seddeler Archive

Archive and analysis tools for Norske Intelligenssedler, Norway's first newspaper (1763-1920s).

About

This project provides tools to programmatically download and analyze the complete digital collection of Norske Intelligenssedler from the Norwegian National Library.

Collection Statistics:

  • 4,270 digitized issues
  • Published: 1763-1920s
  • Location: Oslo
  • License: Public Domain / Creative Commons

Quick Start

# 1. Build corpus index of all 4,270 available issues
python3 scripts/build_corpus.py --output corpus_index.json

# 2. Download newspapers from a specific time period
python3 scripts/download.py --from-year 1768 --to-year 1770 --limit 10

# 3. Modernize old Norwegian text using AI (requires Ollama)
python3 scripts/modernize_text.py --input data/1768/ --output modernized/

LLM-Powered Text Modernization

This project includes AI-powered modernization of 18th-19th century Norwegian using two LLM options:

Option 1: Local AI with Ollama (Free, Private)

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull language model
ollama pull llama3.2:3b

# Modernize texts locally
python3 scripts/modernize_text.py --input data/ --output modernized/

Option 2: Cloud AI with OpenRouter (Best Quality)

# Set up API key (get from https://openrouter.ai/keys)
export OPENROUTER_API_KEY="your_key_here"
# Or: cp .env.example .env  # and add your key

# Modernize with Claude 3.5 Sonnet (highest quality)
python3 scripts/modernize_text.py --openrouter --input data/ --output modernized/

# Or use a different model
python3 scripts/modernize_text.py --openrouter --model "google/gemini-pro" --input data/

Features

  • OCR error correction - Fixes mistakes in old scanned text
  • Language modernization - Old Norwegian (1700s-1900s) → Modern bokmål
  • Automatic summarization - Brief summaries of each issue
  • Named entity extraction - People, places, and events
  • Two backends - Local (Ollama) or Cloud (OpenRouter)

Project Structure

intelligenz/
├── README.md                 # This file
├── WARP.md                   # Internal project documentation
├── shell.nix                 # Nix development environment
├── requirements.txt          # Python dependencies
├── scripts/
│   ├── build_corpus.py       # Fetch list of all issues
│   ├── download.py           # Download newspaper content
│   └── extract_data.py       # Extract and export data
└── data/                     # Downloaded newspaper issues
    └── {year}/{month}/{day}/{issue}.json

Usage Examples

Search for specific content

import dhlab as dh

# Create a corpus of all issues
corpus = dh.Corpus(
    doctype="digavis",
    title="Norske Intelligenssedler",
    from_year=1763,
    to_year=1920
)

# Find concordances for a word
results = corpus.conc(words="handel")

Analyze word frequency over time

from dhlab.api.dhlab_api import ngram_newspapers

# Get frequency of "handel" (trade) over time
freq = ngram_newspapers(
    word="handel",
    title="Norske Intelligenssedler"
)

Data Format

Each downloaded issue is stored as JSON:

{
  "urn": "URN:NBN:no-nb_digavis_norskeintelligenssedler_null_null_17680420_6_16_1",
  "title": "Norske Intelligenssedler",
  "date": "17680420",
  "year": 1768,
  "pages": 4,
  "text": "Full OCR text content...",
  "metadata": {
    "publisher": "...",
    "language": "Norsk NOR"
  }
}

API Documentation

Requirements

  • Nix package manager
  • Python 3.10+
  • Internet connection for API access

License

Code: MIT License Data: Public Domain / CC-BY-NC-ND (varies by issue date)

Acknowledgments

Data provided by the Norwegian National Library (Nasjonalbiblioteket) through their DHLAB API.

About

Archive of Norske Intelligenssedler (1763-1920s) - Norway's first newspaper. Digital preservation, OCR translation, and deep analysis of historical content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors