Archive and analysis tools for Norske Intelligenssedler, Norway's first newspaper (1763-1920s).
This project provides tools to programmatically download and analyze the complete digital collection of Norske Intelligenssedler from the Norwegian National Library.
Collection Statistics:
- 4,270 digitized issues
- Published: 1763-1920s
- Location: Oslo
- License: Public Domain / Creative Commons
# 1. Build corpus index of all 4,270 available issues
python3 scripts/build_corpus.py --output corpus_index.json
# 2. Download newspapers from a specific time period
python3 scripts/download.py --from-year 1768 --to-year 1770 --limit 10
# 3. Modernize old Norwegian text using AI (requires Ollama)
python3 scripts/modernize_text.py --input data/1768/ --output modernized/This project includes AI-powered modernization of 18th-19th century Norwegian using two LLM options:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull language model
ollama pull llama3.2:3b
# Modernize texts locally
python3 scripts/modernize_text.py --input data/ --output modernized/# Set up API key (get from https://openrouter.ai/keys)
export OPENROUTER_API_KEY="your_key_here"
# Or: cp .env.example .env # and add your key
# Modernize with Claude 3.5 Sonnet (highest quality)
python3 scripts/modernize_text.py --openrouter --input data/ --output modernized/
# Or use a different model
python3 scripts/modernize_text.py --openrouter --model "google/gemini-pro" --input data/- ✅ OCR error correction - Fixes mistakes in old scanned text
- ✅ Language modernization - Old Norwegian (1700s-1900s) → Modern bokmål
- ✅ Automatic summarization - Brief summaries of each issue
- ✅ Named entity extraction - People, places, and events
- ✅ Two backends - Local (Ollama) or Cloud (OpenRouter)
intelligenz/
├── README.md # This file
├── WARP.md # Internal project documentation
├── shell.nix # Nix development environment
├── requirements.txt # Python dependencies
├── scripts/
│ ├── build_corpus.py # Fetch list of all issues
│ ├── download.py # Download newspaper content
│ └── extract_data.py # Extract and export data
└── data/ # Downloaded newspaper issues
└── {year}/{month}/{day}/{issue}.json
import dhlab as dh
# Create a corpus of all issues
corpus = dh.Corpus(
doctype="digavis",
title="Norske Intelligenssedler",
from_year=1763,
to_year=1920
)
# Find concordances for a word
results = corpus.conc(words="handel")from dhlab.api.dhlab_api import ngram_newspapers
# Get frequency of "handel" (trade) over time
freq = ngram_newspapers(
word="handel",
title="Norske Intelligenssedler"
)Each downloaded issue is stored as JSON:
{
"urn": "URN:NBN:no-nb_digavis_norskeintelligenssedler_null_null_17680420_6_16_1",
"title": "Norske Intelligenssedler",
"date": "17680420",
"year": 1768,
"pages": 4,
"text": "Full OCR text content...",
"metadata": {
"publisher": "...",
"language": "Norsk NOR"
}
}- DHLAB Python Library: https://nationallibraryofnorway.github.io/DHLAB/
- NB Catalog API: https://api.nb.no/catalog/v1/
- DHLAB API: https://api.nb.no/dhlab/
- Nix package manager
- Python 3.10+
- Internet connection for API access
Code: MIT License Data: Public Domain / CC-BY-NC-ND (varies by issue date)
Data provided by the Norwegian National Library (Nasjonalbiblioteket) through their DHLAB API.