Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions webapp/data/languages/eo/SOURCES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@

## Sources

### 1. Existing Word List
- **Source**: wooorm/dictionaries (Hunspell)
- **URL**: https://github.com/wooorm/dictionaries
### Word list

### 2. FrequencyWords
- **URL**: https://github.com/hermitdave/FrequencyWords
- **License**: MIT (code), CC-BY-SA 4.0 (content)
- **Usage**: Frequency data for daily word ranking and supplement generation
The words are taken from [_ReVo (Reta Vortaro)_](https://github.com/revuloj/revo-fonto/). Derivations with common word endings are also generated and allowed; e.g. while the bare root _hor_ is not a valid word, its forms _horoj_ (hours), _horon_ (hour \[as the direct object]), _horaj_ (hourly \[modifying a plural]), and _horan_ (hourly \[modifying a singular direct object]) are valid words, and allowed because they are 5 letters long.

### Frequency data

The frequency data is taken from the corpus [_La Tekstaro de Esperanto_](https://tekstaro.com/); occurrences are counted in their _exact form_.

These data were compiled and processed by Haley Wakamatsu (@haleyhalcyon).

## Modifications

- `eo_daily_words.txt`: Top 2000 most common words from existing word list, ranked by OpenSubtitles frequency
- `eo_5words_supplement.txt`: 3268 additional valid 5-letter words from FrequencyWords corpus
All dictionaries are from the same source, previously processed for [Intervorto](https://gitlab.com/haleyhalcyon/intervorto/-/blob/master/scripts/freq_tekstaro.tsv?ref_type=heads), newly processed to only contain 5-letter words.
- `eo_daily_words.txt`: Top 993 most frequent words
- `eo_5words.txt`: Top 2755 most frequent words
- `eo_5words_supplement.txt`: The 4343 words that appear at least twice in the Tekstaro

Non-Esperanto words (proper names, English loanwords, words containing x/y/w) have been removed.

## License

The frequency-derived data in this directory is provided under **CC-BY-SA 4.0**, compatible with the FrequencyWords content license.

## Acknowledgments

- **wooorm/dictionaries** for the base word list
- **Hermit Dave** ([FrequencyWords](https://github.com/hermitdave/FrequencyWords)) for frequency data derived from OpenSubtitles
Loading