- Requirements
- 1.1. Virtual Environment
- 1.2. Working Environment
- Scripts
- Data
- 3.1. Download XML
- 3.2. Download Documents
- 3.3. Parse XML to Extract Metadata
- 3.4. Preprocess Metadata
- Roadmap
To ensure that the same requirements are met across different operating systems and machines, it is recommended to create a virtual environment. This can be set up with UV.
which uv || echo "UV not found" # checks the UV installationIf UV is not installed, it can be installed as follows.
curl -LsSf https://astral.sh/uv/install.sh | shAfterwards, the virtual environment can be created and activated.
uv venv .venv # creates a virtual environment with the name ".venv"
source .venv/bin/activate # activates the virtual environmentThen the required packages are installed. UV ensures that the exact versions are installed.
uv sync --all-extras # installs exact versionsBefore running any scripts on an HPC cluster, you need to configure your personal working directory:
-
Copy the environment template:
cp .env.local.template .env.local
-
Edit
.env.localto set your working directory:# Open in your preferred editor nano .env.local # or vim .env.local
-
Update the
PROJECT_ROOTvariable to point to your personal working directory:# Example for user "john.doe": PROJECT_ROOT=/sc/home/john.doe/pilotproject-automatic-metadata # Example for different mount point: PROJECT_ROOT=/home/username/projects/pilotproject-automatic-metadata
-
Verify your configuration:
source .env.local echo "Project root: $PROJECT_ROOT"
Note: The .env.local file is ignored by git, so your personal configuration won't be committed to
the repository.
All scripts are located in the scripts folder.
The scripts/download_xml.py script downloads XML export files from the Brandenburg parliament documentation website.
# Basic usage - downloads all WP 1-8 to data/xml/
python scripts/download_xml.py
# Custom wahlperiode selection
python scripts/download_xml.py --wp 1,2,5-8
# Custom output directory
python scripts/download_xml.py --wp 3-5 --output /tmp/xmlThe script will:
- Download specified
exportWP*.xmlfiles from https://www.parlamentsdokumentation.brandenburg.de/portal/opendata.tt.html - Save XMLs to specified output directory (default:
data/xml/)
The scripts/download_documents.py script extracts URLs from parliamentary XML exports and downloads all referenced PDF and DOCX documents. The script has been optimized with parallel processing for faster downloads.
# Basic usage - downloads all documents from XML files in data/xml/
python scripts/download_documents.py
# With custom parameters for faster processing
python scripts/download_documents.py --workers 16 --retries 5
# Dry run to see what would be downloaded
python scripts/download_documents.py --dry-run
# Custom directories and logging
python scripts/download_documents.py --xml-dir custom/xml --output-dir custom/docs --log-level DEBUGFor large-scale downloads on HPC clusters, use the SLURM batch script:
# Submit job with default settings
sbatch scripts/download_documents.sbatchThe script will:
- Process all
exportWP*.xmlfiles in the specified XML directory - Extract unique document URLs from
<LokURL>elements - Download PDF, DOCX, DOC, and HTML files in parallel
- Save files to the specified output directory (default:
data/documents/) - Generate detailed logs in
logs/download_documents.log
The scripts/parse_xml.py script extracts comprehensive metadata from parliamentary XML exports.
# Parse all XML files and extract metadata
python scripts/parse_xml.py
# Custom input/output directories
python scripts/parse_xml.py --input data/xml --output data/metadata_completeThe script extracts:
- Document details (
DokArt,Titel,Datum,Nummer) - URLs for different file formats (PDF, DOCX, HTML)
- Speaker information (
Redner) - Procedural categories (
Desk) - Procedural text (
BText) - Complete Vorgang-Dokument hierarchy
The scripts/preprocess_metadata.py script handles duplicate references and creates ML-ready datasets.
# Process extracted metadata with default VTyp target
python scripts/preprocess_metadata.py
# Use different classification targets
python scripts/preprocess_metadata.py --target-field primary_dokart # (imbalanced)
python scripts/preprocess_metadata.py --target-field derived_subject_area # Subject-based classes- Text Extraction Pipeline: PDF/DOCX/HTML text extraction
- Text Preprocessing: Clean and normalize German parliamentary text
- Long Document Handling: Chunking strategy for Llama 2 context limits
- Train/Validation/Test Splits: Stratified sampling by VTyp and Wahlperiode
- Final Training Format: Text-target pairs ready for model training
- Llama 2 7B + LoRA Setup: Configure for VTyp classification
- Distributed Training: Utilize up to 8 H100 GPUs with SLURM
- Model Training: Fine-tune for German parliamentary document classification
- Evaluation: Accuracy, F1-score, confusion matrix analysis
- Error Analysis: Identify challenging document types and failure modes
- Multi-Task Architecture: Predict VTyp + Titel + Desk + Redner simultaneously
- Comparative Analysis: Single-task vs. multi-task performance
- Advanced Targets: Subject area classification, document size prediction
- Ensemble Methods: Combine multiple models for improved accuracy
- Document Segmentation: Speaker-based and topic-based chunking
- Hierarchical Prediction: Segment-level → document-level aggregation
- Attention Mechanisms: Learn segment importance for document classification
- Production Pipeline: End-to-end metadata prediction system
