GitHub - libris/swepub-annif: Configuration and trained models for Annif

To use with https://github.com/libris/swepub-redux/

See https://github.com/NatLibFi/Annif for general Annif instructions and advice.

Install and run

Assuming https://github.com/astral-sh/uv is installed:

git clone https://github.com/libris/swepub-annif.git
cd swepub-annif
uv run python -m nltk.downloader punkt_tab
uv run annif run --port 8083 # then check http://localhost:8083
# Instead of `annif run` (for development only), you could use gunicorn, e.g.:
# gunicorn --workers 4 --threads 4 --worker-class gthread --bind 127.0.0.1:8083 "annif:create_app()"
# (...and put behind e.g. nginx in a production environment)

(For Python package installation to work you might need to install some dependencies, e.g. protobuf-compiler in Ubuntu.)

Visit http://localhost:8083 to try the Annif UI. You'll also find Swagger there.

You can also test Annif from the command line, e.g.:

echo 'Cardiac troponin I in healthy Norwegian Forest Cat, Birman and domestic shorthair cats, and in cats with hypertrophic cardiomyopathy' | annif suggest swepub-en
2022-10-11T13:10:35.736Z INFO [omikuji::model] Loading model from data/projects/swepub-en/omikuji-model...
...
<https://id.kb.se/term/ssif/4> Agricultural and Veterinary sciences  0.8900570869445801
<https://id.kb.se/term/ssif/40303> Clinical Science  0.6352069973945618
<https://id.kb.se/term/ssif/403> Veterinary Science  0.4740253984928131
<https://id.kb.se/term/ssif/106> Biological Sciences (Medical to be 3 and Agricultural to be 4) 0.17030012607574463
...

(This will be slow as the model has to be loaded each time you use suggest; normally you should use the REST API.)

Update model

NOTE: this repo already contains a pre-trained model. The following is only if you want to update it (i.e., recreate it from scatch).

First, generate corpora. In swepub-redux repo with the swepub-redux venv:

uv sync
# For quick testing, replace 0 with something low (e.g. 10000).
# 0 = get an unlimited amount of records.
bash misc/create_tsv_sets.sh en 0 3 5 ~/annif-input
bash misc/create_tsv_sets.sh sv 0 3 5 ~/annif-input
cd ~/annif-input
tail -n 700000 training_en.tsv | gzip > training_en.tsv.gz
tail -n 700000 training_sv.tsv | gzip > training_sv.tsv.gz

In this repo and its own venv, train Annif:

uv run annif load-vocab ssif ssif_terms.ttl
uv run annif train -j 0 swepub-en ~/annif-input/training_en.tsv.gz # multiple (and non-gz) files also OK
uv run annif train -j 0 swepub-sv ~/annif-input/training_sv.tsv.gz

Training swepub-en can take more than an hour if your SQLite database contains the entirety of Swepub.

Note that the omikuji-train.txt files are not necessary for running the API (especially the English one gets quite large) and should NOT be committed to the repo.

Additionally, for practical reasons, we do like to keep the latest models in this Git repo without having to use Git LFS or similar, so as to not exceed GitHub limits we use BFG Repo Cleaner to remove large files from history after updating the models (i.e. we forcibly rewrite the Git history):

java -jar ~/somewhere/bfg-1.14.0.jar --delete-files '{*.cbor,vectorizer}'
git reflog expire --expire=now --all && git gc --prune=now --aggressive

(Re)generate ssif_terms.ttl

ssif_terms.csv was created from the Excel version of "Standard för svensk indelning av forskningsämnen 2025" found here.

ssif_terms.ttl is generated with:

python3 ssif_terms_to_skos_ttl.py ssif_terms.csv > ssif_terms.ttl

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
eval.py		eval.py
metrics_20221020_18_31_55.png		metrics_20221020_18_31_55.png
projects.cfg		projects.cfg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ssif_terms.csv		ssif_terms.csv
ssif_terms.ttl		ssif_terms.ttl
ssif_terms_to_skos_ttl.py		ssif_terms_to_skos_ttl.py
utvardering.md		utvardering.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install and run

Update model

(Re)generate ssif_terms.ttl

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install and run

Update model

(Re)generate ssif_terms.ttl

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages