Skip to content

feat: add Danish corpus#61

Open
JeppeKlitgaard wants to merge 1 commit intoApsu:masterfrom
JeppeKlitgaard:master
Open

feat: add Danish corpus#61
JeppeKlitgaard wants to merge 1 commit intoApsu:masterfrom
JeppeKlitgaard:master

Conversation

@JeppeKlitgaard
Copy link

This adds a Danish corpus computed from Wortschatz using the analyser found at https://github.com/JeppeKlitgaard/Corpora/ (see: https://github.com/JeppeKlitgaard/Corpora/blob/master/analyser/danish_recipe.json)

This contains the computed monograms, bigrams, trigrams, and words from 3 million sentences sourced from Wortschatz. 1 million originates from news, 1 million from web, and 1 million from wikipedia entries.

The words.json can either be truncated or omitted as it is quite large.

I can quickly generate these for other languages as well if there is interest.

Let me know if this is useful and whether I need to make changes to get this into cmini.

@f5b7
Copy link

f5b7 commented Jan 29, 2025

I can quickly generate these for other languages as well if there is interest.

German, please. 🙏❤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants